REMARKS 



I. Explanation of Amendments and Interview Summary 

The Applicants acknowledge with thanks the courtesy extended by the 
Examiner to the applicant's attorney David A. Gass during a personal interview on August 2, 
2007, during which time the rejections in the outstanding Office action were discussed. The 
Applicants proposed the presentation of a meta-analysis of data pertaining to the invention 
and the Examiner encouraged the Applicants to present such an analysis in the form of a Rule 
132 declaration. 

The claims were amended to remove reference to "stroke" in order to expedite 
allowance of a preferred embodiment, and not for reasons related to patentability. 

The final clause of claim 61, pertaining to interpretation of a the screening 
results where the haplotype of interest is absent, was amended to clarify that "the absence of 
the haplotype in the nucleic acid of the individual identifies the individual as not having the 
elevated susceptibility to MI due to the haplotype . It will be self-evident that any screening 
test leads to one conclusion if the test is positive and another conclusion if the test is negative. 
The final clause of the claim as originally presented refers to the relative risk from the tested 
haplotype and a person of ordinary skill would have interpreted the claim as drawing a 
conclusion that that the individual has no elevated susceptibility to MI from other, untested 
factors. That, however, appears to be a conclusion drawn by the Patent Office, resulting in a 
rejection alleging lack of enablement. The current amendment further clarifies a conclusion 
that would be reached for a subject that is found not to carry the haplotype. Any further fine- 
tuning of this clause should be amenable to resolution by telephonic interview because the 
Applicants and the Patent Office appear to intend the same scope and meaning for this 
element of the claim. 

II. Remarks Relating to the nature of the invention 

The Applicants continue to dispute the Patent Office's characterization of the 
claimed invention. The invention relates to a method for assessing susceptibility to 
myocardial infarction that involves analysis of nucleic acid sequence in a person's FLAP 
gene. The elected claims are not drawn to polynucleotides, to polymorphisms, or to 
"differences," even though the Patent Office's patentability search may involve looking for 
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such features in the prior art. Rather, the claims are drawn to methods that involve analyzing 
a human individual's DNA at a particular locus. The results of the analysis determine 
whether or not the individual is scored as having elevated risk for myocardial infarction. 
There is common utility for all variations of this method that are described in the application. 

III. The Rejection Under 35 U.S.C. § 112, First Paragraph, Alleging Lack of 
Enablement Should be Withdrawn 

In paragraph 6 the of the Office action the Patent Office rejected claims 61 and 
63-66, alleging lack of enabling disclosure. The Applicants traverse this rejection. 

The Applicants repeat by reference arguments made in their previous 

submissions. 

The rejection was based in part on alleged overbreadth insofar as the claims 
encompassed assessing susceptibility to both MI and stroke. Reference to stroke has been 
deleted by amendment, rendering moot this basis for rejection. 

Most of the remaining discussion of the issue of enabling disclosure in the 
Office action focuses on whether or not the association between FLAP haplotypes and MI 
taught in the application is reproducible. Accompanying this amendment is a sworn 
declaration summarizing a meta-analysis conducted by deCODE genetics, the assignee of this 
application. The meta-analysis shows that the correlation between FLAP haplotypes and MI 
is indeed reproducible. 

Importantly, the meta-analysis includes data from the Zee study cited by the 
Examiner as well as other published studies with available data analyzing the correlation 
between the FLAP haplotype and MI. The meta-analysis is an aggregation of data from 
smaller studies and shows that the correlation between the FLAP haplotype and increased 
risk for MI is real and statistically significant. The statistical power of the meta-analysis is 
much greater due to the large sample size and is much more probative that any individual 
smaller study that did not necessarily detect the correlation due to small sample size. 

The Patent Office analyzes various articles or studies that pertain in some 
fashion to gene-disease correlation, but that are unrelated to the subject matter of the claims 
(e.g., Mayer et al., SNP's in the CADPKL gene and neurological disorders). These studies 
have no probative value with respect to FLAP -MI correlation, especially in view of the 
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abundant data now available pertaining to FLAP-MI. The Applicants pointed to numerous 
defects in articles cited by the Patent Office or their failure to support the proposition for 
which they were cited, yet the PTO has continued to rely on the articles without addressing 
the defects. 

The Patent Office alleges (section titled "Guidance in the Specification") that 
the specification provides no evidence that the invention can be practiced "as broadly 
claimed" with respect to individual markers. This aspect of the Office action has no 
relevance to the elected claims, which pertain to a particular four-polymorphism haplotype. 
The haplotype of the claim is shown to have a statistically significant correlation with 
increased risk of MI in the application and in the meta-analysis referred to above. 

The Patent Office raised concerns about the proper wording of the claims as 
they pertain to individuals that do not have the tested-for HapA FLAP haplotype. These 
concerns are addressed above, and do not give rise to any questions of enabling disclosure. 

The Patent Office's final concern appears to relate to ethnic or "inter-ethnic" 
variability conferring different risks. Even if true, such variability does not give rise to 
questions of statutory enablement for the present claims. Statutory enablement involves 
whether an application describes an invention in a manner that allows those of ordinary skill 
to practice the invention. The present application teaches a person of ordinary skill how to 
perform the haplotype screen without regard to ethnicity, and teaches the conclusion that can 
reasonably be drawn from it based on population genetics. As with other correlation tests, 
the results provide helpful information for medical treatment or lifestyle management, and . 
are indicative of risk at the population level. The present invention is appropriately claimed 
insofar as an individual is assessed for one type of data (FLAP haplotype) and a conclusion 
about susceptibility (supported by statistically validated data) is drawn based on the FLAP 
haplotype assessment only. The conclusion does not require ethnicity data. 

Human variability is the rule, not the exception, for all aspects of medicine, 
including diagnostic tests based on biochemistry; safety of drugs; efficacy of drugs; 
susceptibility to diseases; life expectancy, and so on. While it may be possible or desirable to 
refine any medical test or treatment or other medical procedure to an ethnicity or sub- 
ethnicity, that is not the current state of medicine and is not part of the statutory requirement 
of enablement for a claim that does not require a conclusion based on ethnicity. The 
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Applicants previous amendment cited many examples of diagnostic tests that are considered 
medically useful, even though their predictive value with respect to any particular person is 
not considered a certainty. The data in the application and the larger meta-analysis show that 
the test is valid and useful and provides another tool for assessing risk for ML 



relative risk of three or more before accepting a paper for publication. As explained in 
greater detail in the declaration filed herewith, RR > 3 is clearly NOT the standard in the field 
for accepting publications. (The inventor's paper was published in prestigious Nature 
Genetics without RR > 3.) Nor is RR > 3 descriptive of the risk that would reasonably be 
attributed to multi-factorial diseases. Nor is RR > 3 a relevant indicator for statutory 
enablement. A conclusion of enablement is appropriate in view of the fact that the 
incremental risk for MI associated with FLAP Hap A, though not nearly as high as 3.0, has 
been shown through large studies and meta-analysis of multiple studies to be statistically 
significant, and not an artifact of a small study. 



the term "co-inventor" when "inventor" should have been used. The Applicants apologize 
for any confusion caused by this typographical error. In addition, the filed Helgadottir 
declaration omitted Exhibit G, which is references Falk and Rubinstein and Terwilliger and 
Ott. These references are submitted herewith as Appendix B. 



In view of the foregoing amendment and remarks, Applicants believe pending 
claims 61-66 are in condition for allowance and early notice thereof is solicited. 



The Patent Office alleges that "as a general rule of thumb" the field looks for a 



The Patent Office observed that the Helgadottir declaration inadvertently used 



CONCLUSION 



Dated: October 16, 2007 




-^Registration N*k: 48^84 
MARSHALL, <jERSTEIN & BORUN LLP 
233 S. Wacker Drive, Suite 6300 
Sears Tower 

Chicago, Illinois 60606-6357 
(312) 474-6300 
Attorney for Applicant 
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The gene encoding 5-lipoxygenase activating protein 
confers risk of myocardial infarction and stroke 

Anna Helgadottir 1 , Andrei Manolescu 1 , Gudmar Thorleifsson 1 , Solveig Gretarsdottir 1 , Helga Jonsdottir 1 , 
Unnur Thorsteinsdottir 1 , Nilesh J Samani 2 , Gudmundur Gudmundsson 1 , Struan F A Grant 1 , 
Gudmundur Thorgeirsson 3 , Sigurlaug Syeinbjornsdottir 3 , Einar M Valdimarsson 3 , Stefan E Matthiasson 3 , 
Halldor Johannsson 3 , Olof Gudmundsdottir 1 , Mark E Gurney 1 , Jesus Sainz 1 , Margret Thorhallsdottir 1 , 
Margret Andresdottir 1 , Michael L Frigge 1 , Eric J Topol 4 , Augustine Kong 1 , Vilmundur Gudnason 5 , 
Hakon Hakonarson 1 , Jeffrey R Gulcher 1 & Kari Stefansson 1 
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We mapped a gene predisposing to myocardial infarction to a locus on chromosome 1 3q1 2-1 3. A four-marker single-nucleotide 
polymorphism (SNP) haplotype in this locus spanning the gene ALOX5AP encoding 5-lipoxygenase activating protein (FLAP) is 
associated with a two times greater risk of myocardial infarction in Iceland. This haplotype also confers almost two times greater 
risk of stroke. Another ALOX5AP haplotype is associated with myocardial infarction in individuals from the UK. Stimulated 
neutrophils from individuals with myocardial infarction produce more leukotriene B4, a key product in the 5-lipoxygenase 
pathway, than do neutrophils from controls, and this difference is largely attributed to cells from males who carry the at-risk 
haplotype. We conclude that variants of ALOX5AP are involved in the pathogenesis of both myocardial infarction and stroke by 
increasing leukotriene production and inflammation in the arterial wall. 
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Cardiovascular diseases (CVD) are the leading causes of death and dis- 
ability in the developed world 1 , with an increasing prevalence due to 
the aging of the population and the obesity epidemic. More than 
1 million deaths in the US alone were caused by myocardial infarction 
and stroke in 2003 (ref. 2). Some of the processes underlying myocar- 
dial infarction are now understood: it is generally attributed to athero- 
sclerosis with arterial wall inflammation that ultimately leads to 
plaque rupture, fissure or erosion 3,4 . This process is known to involve 
diapedesis of monocytes across the endothelial barrier; activation of 
neutrophils, macrophage cells and platelets; and release of a variety of 
cytokines and chemokines 5 ' 6 , but the genetic basis of the process has 
not yet been deciphered. 

Two different approaches have been used to search for genes associ- 
ated with myocardial infarction. SNPs in candidate genes have been 
tested for association and have, in general, not been replicated or con- 
fer only a modest risk of myocardial infarction. Case-control associa- 
tion studies have identified several proinflammatory genes with 
variants that are associated with either an increased risk of myocardial 
infarction or a protective effect 7 " 9 . Four genome- wide scans in families 
with myocardial infarction have yielded several loci with formidable 
linkage peaks, but the gene(s) underlying these loci have not yet been 
identified 10 " 14 . In addition, one large pedigree study identified a dele- 



tion mutation of a transcription factor gene, MEF2A, with autosomal 
dominant transmission 14 . This is an interesting cause of myocardial 
infarction, but the prevalence of this or other mutations in MEF2A 
outside this family remains to be determined. 

Here we report a genome-wide scan of 296 multiplex Icelandic 
families including .713 individuals with myocardial infarction. 
Through suggestive linkage to a locus on chromosome 13ql2-13, we 
identified the gene (ALOX5AP) encoding FLAP and found that a 
four-SNP haplotype in the gene confers a nearly two times greater 
risk of myocardial infarction and stroke. FLAP is a regulator 15 of a 
crucial pathway in the genesis of leukotriene inflammatory media- 
tors, which are implicated in atherosclerosis both in a mouse 
model 16 and in human studies 17,18 . Males had the strongest associa- 
tion to the at-risk haplotype, and male carriers of the at-risk haplo- 
type also had significantly greater production of leukotriene-B4 
(LTB4), supporting the idea that proinflammatory activity has a role 
in the pathogenesis of myocardial infarction. We confirmed the asso- 
ciation of ALOX5AP with myocardial infarction in an independent 
cohort of British individuals with another haplotype. These results 
indicate that ALOX5AP is the first specific gene isolated that confers 
substantial population-attributable risk (PAR) of the complex traits 
of both myocardial infarction and stroke. 



ideCODE genetics, Sturlugata 8, Reykjavik, Iceland, department of Cardiovascular Sciences, University of Leicester, Glenfield Hospital, Leicester, UK. National 
University Hospital, Reykjavik, Iceland. Cleveland Clinic Foundation, Cleveland, Ohio, USA. Icelandic Heart Association, Reykjavik, Iceland. Correspondence should 
be addressed to K.S. (kstefans@decode.is). 
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RESULTS 
Linkage analysis 

We carried out a genome-wide scan in search of myocardial infarction 
susceptibility genes using a framework set of 1,068 microsatellite 
markers. The initial linkage analysis included 713 individuals with 
myocardial infarction who fulfilled the World Health Organization 
(WHO) MONICA research criteria 19 and were clustered in 296 
extended families. We repeated the linkage analysis for individuals 
with early onset, for males and for females separately. A description of 
the number of affected individuals and families in each analysis is 
provided in Supplementary Table 1 online, and the corresponding 
allele-sharing lod scores are given in Supplementary Figure 1 online. 
None of these analyses yielded a locus of genome-wide significance. 
The most promising lod score (2.86) was observed on chromosome 
13q 12-13 for linkage with females with myocardial infarction at the 
peak marker D13S289 (Supplementary Fig. 1 online). This locus also 
had the most promising lod score (2.03) for individuals with early- 
onset myocardial infarction. After we increased the information on 
identity-by-descent sharing to over 90% by typing an additional 14 
microsatellite markers in a 30-cM region around D13S289> the lod 
score for the association in females dropped to 2.48 (P = 0.00036), 
and the lod score remained highest at D13S289 (Fig. la). In an inde- 
pendent linkage study of males with ischemic stroke or transient 
ischemic attack (T1A), we observed linkage to the same locus with a 
lod score of 1.51 at the same peak marker (Supplementary Fig. 2 
online), further suggesting that a cardiovascular susceptibility factor 
might reside at this locus. 

Microsatellite association study 

The 7.6-Mb region that corresponds to a drop of 1 in lod score in the 
female-myocardial infarction linkage analysis contains 40 known 
genes (Supplementary Table 2 online). To determine which gene in 



Figure 1 Schematic view of the chromosome 13 linkage region showing 
ALOX5AP. (a) The linkage scan for females with myocardial infarction and 
the one-lod drop region that includes AL0X5AP. (b) Microsatellite 
association for all individuals with myocardial infarction: single-marker 
association (black dots) and two-, three-, four- and five-marker haplotype 
association (black, blue, green and red horizontal lines, respectively). The 
blue and red arrows indicate the location of the most significant haplotype 
association across ALOX5AP in males and females, respectively, (c) 
AL0X5AP gene structure, with exons shown as colored cylinders, and the 
locations of all SNPs typed in the region. The green vertical lines indicate 
the position of the microsatellites (b) and SNPs (c) used in the analysis. 



this region was most likely to contribute to myocardial infarction, we 
typed 120 microsatellite markers in the region and carried out a case- 
control association study using 802 unrelated (separated by at least 
three meioses) individuals with myocardial infarction and 837 popu- 
lation-based controls. We also repeated the association study for each 
of the three phenotypes that were used in the linkage study: individu- 
als with early onset, males and females with myocardial infarction. In 
addition to testing each marker individually, we also tested haplo- 
types based on these markers for association. To limit the number of 
haplotypes tested, we considered only haplotypes spanning less than 
300 kb that were over- represented among the affected individuals. 

The haplotype with the strongest association to myocardial infarc- 
tion (P = 0.00004) covered a region that contains two known genes: 
ALOX5AP (Fig. lb) and a gene with an unknown function called 
highly charged protein (D13S106E). The haplotype association in this 
region for females with myocardial infarction was less significant (P = 
0.0004) than for all individuals with myocardial infarction, and the 
most significant haplotype association was observed for males with 
myocardial infarction (P = 0.000002). The haplotype associated with 
males with myocardial infarction was the only haplotype that retained 
significant association after adjusting for all haplotypes tested. 

FLAP, together with .5 -lipoxygenase (5-LO), is a regulator of the 
leukotriene biosynthetic pathway that has recently been implicated in 
the pathogenesis of atherosclerosis 16 " 18 . Therefore, ALOX5AP was a 
good candidate for the gene underlying the association with myocar- 
dial infarction. 

Screening for SNPs in ALOX5AP and LD mapping 

To determine whether variations in ALOX5AP significantly associate 
with myocardial infarction and to search for causal variations, we 
sequenced ALOX5AP in 93 affected individuals and 93 controls. The 
sequenced region covers 60 kb containing ALOX5AP, including the 
five known exons and introns, the 26-kb region 5' to the first exon and 
the 7-kb region 3' to the fifth exon. We identified 144 SNPs, of which 
we excluded 96 from further analysis owing to either a low minor allele 
frequency or complete correlation (redundancy) with other SNPs. 
Figure lc shows the distribution of the 48 SNPs chosen for genotyp- 
ing, relative to exons, introns and the 5' and 3' flanking regions of 
ALOX5AP. We identified only one SNP in a coding sequence (exon 2), 
which did not lead to an amino acid substitution. The locations of the 
48 SNPs in the National Center for Biotechnology Information human 
genome assembly build 34 are listed in Supplementary Table 3 online. 
In addition to the SNPs, we typed a polymorphism consisting of a 
monopolymer A repeat in the ALOX5AP promoter region 20 . 

The linkage disequilibrium (LD) block structure defined by the 48 
genotyped SNPs is shown in Figure 2. Strong LD was detected across 
the ALOX5AP region, although at least one historical recombination 
seems to have occurred, dividing the region into two strongly corre- 
lated LD blocks. 
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Figure 2 Pairwise LD between SNPs in a 60-kb region encompassing 
AL0X5AP, The markers are plotted equidistantly. Two measures of LD are 
shown: D' in the upper left triangle and Pvalues in the lower right triangle. 
Colored lines indicate the positions of the exons of ALOX5AP, and the green 
stars indicate the location of the markers of the at-risk haplotype HapA. 
Scales for both measures of the LO strength are provided on trie right. 



Haplotype association with myocardial infarction 

In a case-control association study, we genotyped the 48 selected SNPs 
and the monopolymer A repeat marker in a set of 779 unrelated indi- 
viduals with myocardial infarction and 624 population-based con- 
trols. We tested each of the 49 markers individually for association 
with the disease. Three SNPs, one located 3 kb upstream of the first 
exon and the other two 1 kb and 3 kb downstream of the first exon, 
showed nominally significant association to myocardial infarction 
(Supplementary Table 4 online). After adjusting for the number of 
markers tested* however, these results were not significant. We then 
searched for haplotypes associated with the disease using the same 
cohorts. We limited the search to haplotype combinations constructed 
from two, three or four SNPs and tested only haplotypes that were 
over-represented in the individuals with myocardial infarction. The 
resulting Pvalues were adjusted for all the haplotypes we tested by ran- 
domizing the affected individuals and controls. 

Several haplotypes were significantly associated with the disease at 
an adjusted significance level of P < 0.05 (Supplementary Table 5 
online). We observed the most significant 
association with a four-SNP haplotype span- 
ning 33 kb, including the first four exons of 
ALOX5AP (Fig. lc), with a nominal P value of 
0.0000023 and an adjusted P value of 0.005. 
This haplotype, called HapA, has a haplotype 
frequency of 15.8% (carrier frequency 29.1%) 
in affected individuals versus 9.5% (carrier 
frequency 18.1%) in controls (Table 1). The 
relative risk conferred by HapA compared 
with other haplotypes constructed from the 
same SNPs, assuming a multiplicative model, 
was 1.8 and the corresponding PAR was 
13.5%. HapA was present at a higher fre- 
quency in males (carrier frequency 30.9%) 
than in females with myocardial infarction 
(carrier frequency 25.7%; Table 1 ). All other 
haplotypes that were significantly associated 
with an adjusted P value less than 0.05 were 



highly correlated with HapA and should be considered variants of that 
haplotype (Supplementary Table 5 online). 

Association of HapA with stroke and PAOD 

Because of the high degree of comorbidity among myocardial infarc- 
tion, stroke and peripheral arterial occlusive disease (PAOD), with 
most of these cases occurring on the basis of an atherosclerotic disease, 
we wanted to determine whether HapA was also associated with stroke 
or PAOD. We typed the SNPs defining HapA for these cohorts. We 
removed first- and second- degree relatives and all known cases of 
myocardial infarction and tested for association in 702 individuals 
with stroke and 577 individuals with PAOD (Table 1). We observed a 
significant association of HapA with stroke, with a relative risk of 1.67 
(P = 0.000095). In addition, we determined whether HapA was pri- 
marily associated with a particular subphenotype of stroke and found 
that both ischemic and hemorrhagic stroke were significantly associ- 
ated with HapA (Supplementary Table 6 online). Finally, although 
HapA was more frequent in the PAOD cohort than in the population 
controls (Table 1), this was not significant. Similar to the stronger 
association of HapA with males with myocardial infarction than with 
females with myocardial infarction, HapA also showed stronger asso- 
ciation with males than with females with stroke and PAOD (Table 1). 

Haplotype association in a British cohort 

In an independent study, we determined whether variants in 
ALOX5AP also affected the risk of myocardial infarction in a popula- 
tion outside Iceland. We typed SNPs defining HapA in a cohort of 753 
individuals from the UK who had sporadic myocardial infarction and 
in 730 British population controls. The affected individuals and con- 
trols were from three separate study cohorts recruited in Leicester and 
Sheffield. We found a slightly higher frequency of HapA in affected 
individuals versus controls (16.8% versus 15.1%, respectively), but the 
results were not statistically significant. As in the Icelandic population, 
HapA was more common in males with myocardial infarction (carrier 
frequency 31.7%) than in females with myocardial infarction (carrier 
frequency 28.0%). When we typed an additional nine SNPs, distrib- 
uted across ALOX5AP y in the British cohort and searched for other 
haplotypes that might be associated with myocardial infarction, two 
SNPs showed association to myocardial infarction with a nominally 
significant P value (data not shown). Moreover, three- and four-SNP 
haplotype combinations were associated with higher risk of myocar- 
dial infarction in the British cohort, and we observed the most signifi- 



Table 1 Association of HapA with myocardial infarction, stroke and PAOD 


Phenorype (n) 


Frequency 


RR 


PAR 


P value 


P value 3 


Myocardial infarction (779) 


0.158 


1.80 


0.135 


0.0000023 


0.005. 


Males (486) 


0.169 


1.95 


0.158 


0.00000091 


ND 


Females (293) 


0.138 


1.53 


0.094 


0.0098 


NO 


Early onset (358) 


0.139 


1.53 


0.094 


0.0058 


ND 


Stroke (702) b 


0.149 


1.67 


0.116 


0.000095 


ND 


Males (373) 


0.156 


1.76 


0,131 


0.00018 


ND 


Females (329) 


0.141 


1.55 


0.098 


0.0074 


ND 


PAOD (577) b 


0.122 


1.31. 


0.056 


0.061 


ND 


Males (356) 


0.126 


1.36 


0.065 


0.057 


ND. 


Females (221) 


0.114 


1.22 


0.041 


0.31 


ND 



a P value adjusted for the number of haplotypes tested. Excluding known cases of myocardial infarction. 
Shown is HapA of AL0X5AP and the corresponding number of affected individuals in), the haplotype frequency in 
affected individuals, the relative risk (RR), PAR and Pvalues. HapA is defined by the SNPs SG13S25, SG13S114, 
SG13S89 and SG13S32 (Supplementary Table 5 online). The same controls (n » 624) were used for the association 
analysis in myocardial infarction, stroke and PAOD as well as for the analysis of males, females and individuals with 
early onset. The frequency of HapA in the control cohort is 0.095. ND, not done. 



Table 2 Association of HapB with myocardial infarction in British individuals 



Phenotype in) 


Frequency 


RR 


..PAR 


P value 


P value 8 


Myocardial infarction (753) 


0.075 


1.95 


0.072 


0.00037 


0.046 


Males (549) 


0.075 


1.97 


0.072 


0.00093 


NO 


Females (204) 


0.073 


1.90 


0.068 


0.021 


ND 



a P value adjusted for the number of haplotypes tested using 1,000 randomization tests. 

Shown are the results for HapB that shows the strongest association in the British myocardial infarction cohort. HapB 
is defined by the SNPs SG13S377, SG13S1 14, SG13S41 and SG13S35, which have the alleles A, A, A and G, 
respectively. In all three phenotypes shown, the same set of 730 British controls was used and the frequency of HapB 
in the control cohort is 0.040. Number of affected individuals in), haplotype frequency in affected individuals, 
relative risk (RR) and PAR are indicated. ND, not done. 



cant association for a four-SNP haplotype with a nominal P value of 
0.00037 (Table 2). We call this haplotype HapB. The haplotype fre- 
quency of HapB was 7.5% in the individuals with myocardial infarc- 
tion (carrier frequency 14.4%) compared with 4.0% (carrier 
frequency 7.8%) in controls, conferring a relative risk of 1.95 (Table 
2). This association of HapB remained significant after adjusting for 
all haplotypes tested, using 1,000 randomization steps, with an 
adjusted P = 0.046. No other SNP haplotype had an adjusted P value 
<0.05. The two at-risk haplotypes, HapA and HapB, are mutually 
exclusive; there are no instances in which the same chromosome car- 
ries both haplotypes. 

More LTB4 in individuals with myocardial infarction 

To determine whether individuals with a past history of myocardial 
infarction had greater activity of the 5-LO pathway than controls, we 
measured production of LTB4 (a key product of the 5-LO pathway) 
in blood neutrophils isolated from Icelandic individuals with 
myocardial infarction and controls before and after stimulation with 
the calcium ionophore ionomycin. We detected no difference in 
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Ml (41) 
Control (35) 



Male Ml 

with HapA (10) 

CZ3 

Male Ml 

without HapA (18) 



Control (35) 



15 min 



30 min 



Figure 3 LTB4 production of ionomycin-stimulated neutrophils from, 
individuals with myocardial infarction {n = 41) and controls (n = 35). The log- 
transformed (mean ± s.d.) values measured at 15 and 30 min in stimulated 
cells are shown. (a).LTB4 production in individuals with myocardial infarction 
(Ml) and controls. The difference in the mean values between affected 
individuals and controls was tested using a two-sample Mest of the log- 
transformed values, (b) LTB4 production in males with myocardial infarction 
carrying HapA {red bars) and not carrying HapA (white bars). Mean values of 
controls (blue bars) are included for comparison. Males with HapA produced 
the highest amounts of LTB4 (P< 0.005 compared with controls). Data for 
females are shown in Supplementary Table 7 online. 



LTB4 production in resting neutrophils from 
individuals with myocardial infarction ver- 
sus controls. In contrast, LTB4 generation by 
neutrophils stimulated with ionomycin was 
substantially greater in individuals with 
myocardial infarction than in controls after 
15 and 30 min, respectively {Fig. 3a). 
Moreover, the observed difference in release 
of LTB4 was largely accounted for by male 
carriers of HapA (Fig. 3b), whose cells pro- 
duced significantly more LTB4 than cells 
from controls (P = 0.0042; Supplementary 
Table 7 online). There was also a heightened LTB4 response in males 
who did not carry HapA, but this difference was of borderline signif- 
icance (Supplementary Table 7 online). This could be explained by 
additional variants in ALOX5AP that have not been uncovered, or in 
other genes belonging to the 5-LO pathway, that may account for 
upregulauon of the LTB4 response in some individuals without the 
ALOX5AP at-risk haplotype. We did not detect differences in LTB4 
response in females (Supplementary Table 7 online), but because of 
the small sample size, this result is hot conclusive. The elevated levels 
of LTB4 production in stimulated neutrophils from male carriers of 
the at-risk haplotype suggest that the disease-associated variants of 
ALOX5AP heighten the response of FLAP to factors that stimulate 
inflammatory cells. 

DISCUSSION 

Our results show that variants of ALOX5AP encoding FLAP are asso- 
ciated with greater risk of myocardial infarction and stroke. In our 
Icelandic cohort, a haplotype that spans ALOX5AP is carried by 
29.1% of all individuals with myocardial infarction and almost dou- 
bles the risk of myocardial infarction. We then replicated these find- 
ings in an independent cohort of individuals with stroke. 
Furthermore, stimulated neutrophils from individuals with myocar- 
dial infarction had greater production of LTB4, one of the key prod- 
ucts of the 5-LO pathway. When we examined this in the context of 
the at-risk haplotype, however, the gain of function was largely 
attributed to male carriers of the at-risk haplotype, who also had the 
strongest association with the ALOX5AP haplotype. Another haplo- 
type spanning ALOX5AP was associated with myocardial infarction 
in a British cohort. Although the pathogenic variants responsible for 
the effects associated with the disease haplotypes are unknown, the 
greater production of LTB4 observed in ionomycin-stimulated neu- 
trophils from male carriers of the at-risk haplotype suggests that the 
disease-associated variants increase the response of FLAP to factors 
that stimulate inflammatory cells. 

We observed suggestive linkage to chromosome 13ql2— 13 with 
several different phenotypic groups, including females with myocar- 
dial infarction, individuals of both sexes with early-onset myocardial 
infarction and males with ischemic stroke or TIA. But we observed 
the strongest haplotype association for males with myocardial 
infarction or stroke. Therefore, the linkage signal in females with 
myocardial infarction and in individuals with early-onset myocardial 
infarction is not explained by the at-risk haplotype that we identi- 
fied, and we expect that there may be other unidentified variants or 
haplotypes in ALOX5AP, or in other genes in the linkage region, that 
may confer risk of these cardiovascular phenotypes. These variants 
are probably rarer than HapA with relatively high penetrance, higher 
in women than in men. 

FLAP has an important role in the initial steps of leukotriene 
biosynthesis 15 , which is largely confined to leukocytes and can be 
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triggered by a variety of stimuli. In this biosynthetic pathway, unes- 
terified arachidonic acid is converted to LTA4 by the action of 5-LO 
and its activating protein FLAP. The unstable epoxide LTA4 is fur- 
ther metabolized to LTB4 or LTC4 by LTA4 hydrolase and LTC4 syn- 
thase, respectively. In addition, LTA4 can be exported to 
neighboring cells that are devoid of 5-LO activity and become sub- 
ject to transcellular leukotriene biosynthesis 21 " 23 . The leukotrienes 
have a variety of proinflammatory effects 24,25 . LTB4 activates leuko- 
cytes, leading to chemotaxis and increased adhesion of leukocytes to 
vascular endothelium, release of lysosomal enzymes such as 
myeloperoxidase and production of superoxide anions 25 . The cys- 
teinyl-containing leukotrienes (LTC4 and its metabolites LTD4 and 
LTE4) increase vascular permeability in postcapillary venules and 
are potent vasoconstrictors of coronary arteries 26 " 28 . 

The importance of the 5-LO pathway is well established in 
asthma, and drugs inhibiting this pathway have been developed for 
treating asthma. The role of the 5-LO pathway in the pathogenesis 
of atherosclerosis has recently received attention. A study of post- 
mortem pathologic specimens showed an increase in the expression 
of members of the 5-LO pathway, including 5-LO and FLAP, in ath- 
erosclerotic lesions at various stages of development in the aorta, 
coronary arteries and carotid arteries 18 . Furthermore, 5-LO was 
localized to macrophages, dendritic cells, foam cells, mast cells and 
neutrophilic granulocytes, and the number of cells expressing 5-LO 
was markedly greater in advanced lesions 18 . The leukocytes positive 
for 5-LO accumulated at distinct sites that are most prone to rup- 
ture 29 , such as the shoulder regions below the fibrous cap of the ath- 
erosclerotic lesion 18 . A 5-LO promoter variant is associated with 
abnormal carotid artery intima-media thickness and heightened 
inflammatory biomarkers 30 . In addition, antagonists of LTB4 block 
the development of atherosclerosis in apo-E-deficient and LDRL- 
deficient mice 31 , and a congenic mouse strain with a heterozygous 
deficiency of 5-LO shows resistance to atherosclerosis 16 , further 
supporting the idea that greater activity of the 5-LO pathway has a 
role in predisposition to atherosclerosis. 

Our data also show that the at-risk haplotype of ALOX5AP has 
higher frequency in all subgroups of stroke, including ischemic stroke, 
TIA and hemorrhagic stroke. HapA confers significandy higher risk of 
myocardial infarction and stroke than it does of PAOD. This could be 
explained by differences in the pathogenesis of these diseases. Unlike 
individuals with PAOD, who have ischemic legs because of atheroscle- 
rotic lesions that are responsible for gradually diminishing blood flow 
to the legs, individuals with myocardial infarction and stroke have suf- 
fered acute events, with disruption of the vessel wall suddenly decreas- 
ing blood flow to regions of the heart and the brain. 

We did not find association between HapA and myocardial infarc- 
tion in a British cohort, but we did find significant association between 
myocardial infarction and a different ALOX5AP variant. The existence 
of different haplotypes of the gene conferring risk to myocardial 
infarction in different populations is not unexpected. It is not unrea- 
sonable to assume that a common disease like myocardial infarction is 
associated with many different mutations or sequence variations and 
that the frequencies of these disease-associated variants may differ 
between populations. It would also not be unexpected for the same 
mutation to arise on different haplotypic backgrounds. 

Our work suggests that ALOX5AP has an important role in the 
pathogenesis of myocardial infarction and stroke in humans. Our 
study, together with others, may provide the necessary background to 
launch therapeutic trials to determine whether pharmacological inhi- 
bition of FLAP will prevent the development of myocardial infarction 
and stroke. 



METHODS 

Study population. We recruited the individuals in the study from a registry of 
over 8,000 individuals, which includes all individuals who had myocardial 
infarctions before the age of 75 in Iceland from 1981 to 2000. This registry is a 
part of the WHO MONICA Project' 9 . Diagnoses of all individuals in the reg- 
istry follow strict diagnostic rules based on signs, symptoms, electrocardio- 
grams, cardiac enzymes and necropsy findings. 

We used genotypes from 713 individuals with myocardial infarction and 
1 ,741 of their first-degree relatives in the linkage analysis. For the microsatellite 
association study of the locus associated with myocardial infarction, we used 
802 unrelated (no first- or second -degree relatives) individuals with myocardial 
infarction (233 females, 624 males and 302 with early onset) and 837 popula- 
tion-based controls. The females studied were post-menopausal. Over 90% of 
the individuals were taking aspirin or other nonsteroidal anti-inflammatory 
drugs. For the SNP association study in and around ALOX5AP> we genotyped 
779 unrelated individuals with myocardial infarction (293 females, 486 males 
and 358 with early onset). The control group for the SNP association study was 
population-based and comprised of 624 unrelated males and females 20-90 
years of age whose medical history was unknown. The stroke and PAOD 
cohorts used in this study have previously been described 32-34 . For the stroke 
linkage analysis, we used genotypes from 342 males with ischemic stroke or TIA 
that were linked to at least one other male within and including six meioses in 
164 families. For the association studies, we analyzed 702 individuals with all 
forms of stroke (329 females and 373 males) and 577 individuals with PAOD 
(221 females and 356 males). Individuals with stroke or PAOD who also had 
myocardial infarction were excluded. Controls used for the stroke and PAOD 
association studies were the same as used in the myocardial infarction SNP 
association study. 

The study was approved by the Data Protection Commission of Iceland and 
the National Bioethics Committee of Iceland. We obtained informed consent 
from all study participants. Personal identifiers associated with medical infor- 
mation and blood samples were encrypted with a third-party encryption sys- 
tem as previously described 35 . 

Statistical analysis. We carried out a genome-wide scan as previously 
described 33 , using a set of 1,068 microsatellite markers. We used multipoint, 
affected-only allele-sharing methods 36 to assess the evidence for linkage. All 
results were obtained using the program Allegro 37 and the deCODE genetic 
map 38 . We used the S pairs scoring function 39,40 and the exponential allele-shar- 
ing model 36 to generate the relevant 1 -degree-of-freedom statistics. When 
combining the family scores to obtain an overall score, we used a weighting 
scheme that is halfway on a log scale between weighting each affected pair 
equally and weighting each family equally. In the analysis, all genotyped indi- 
viduals who were not affected were treated as 'unknown*. Because of concern 
with small-sample behavior, we usually computed corresponding P values in 
two different ways for comparison and report the less significant one. The first 
P value was computed based on large sample theory, Zj r = V(2 log< ( 1 0) lod), and 
is distributed approximately as a standard normal distribution under the null 
hypothesis of no linkage 36 . A second P value was computed by comparing the 
observed lod score with its complete data sampling distribution under the null 
hypothesis 37 . When a data set consisted of more than a handful of families, 
these two P values tended to be very similar. The information measure we used, 
which is implemented in Allegro, is closely related to a classical measure of 
information and has a property that is between 0 (if the marker genotypes are 
completely uninformative) and 1 (if the genotypes determine the exact amount 
of allele sharing by descent among the affected relatives) 4 M2 . 

For single-marker association studies, we used Fisher's exact test to calculate 
two-sided P values for each allele. All P values are unadjusted for multiple com- 
parisons unless specifically indicated. We present allelic rather man carrier fre- 
quencies for microsatellites, SNPs and haplotypes. To minimize any bias due to 
the relatedness of the individuals who were recruited as families for the linkage 
analysis, we eliminated first- and second-degree relatives. For the haplotype 
analysis we used the program NEMO 32 , which handles missing genotypes and 
uncertainty with phase through a likelihood procedure, using the expectation- 
maximization algorithm as a computational tool to estimate haplotype fre- 
quencies. Under the null hypothesis, the affected individuals and controls were 
assumed to have identical haplotype frequencies. Under the alternative 
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hypotheses, the candidate at-risk haplotype was allowed to have a higher fre- 
quency in the affected individuals than in controls, and the ratios of frequencies 
of all other haplotypes were assumed to be the same in both groups. 
Likelihoods were maximized separately under both hypotheses, and a corre- 
sponding 1-degreerof- freedom likelihood ratio statistic was used to evaluate 
statistical significance 32 . Although we only searched for haplotypes that 
increased the risk, all reported P values are two-sided unless otherwise stated. 
To assess the significance of the haplotype association corrected for multiple 
testing, we carried out a randomization test using the same genotype data. We 
randomized the cohorts of affected individuals and controls and repeated the 
analysis. This procedure was repeated up to 1 ,000 times, and the P value we pre- 
sent is the fraction of replications that produced a P value for a haplotype tested 
that was lower than or equal to the P value we observed using the original 
affected individual and control cohorts. 

For both single-marker and haplotype analysis, we calculated relative risk 
(RR) and PAR assuming a multiplicative model 43 ' 44 in which the risk of the two 
alleles of haplotypes a person carries multiply. We calculated LD between pairs 
of SNPs using the standard definition of D* (ref. 45) and R 2 (ref. 46). Using 
NEMO, we estimated frequencies of the two marker allele combinations by 
maximum likelihood and evaluated deviation from linkage equilibrium by a 
likelihood ratio test. When plotting all SNP combinations to elucidate the LD 
structure in a particular region, we plotted U in the upper left corner and the P 
value in the lower right corner. In the LD plots we present, the markers are plot- 
ted equidistantly rather than according to their physical positions. 

Identification of DNA polymorphisms. We identified new polymorphic 
repeats (dinucleotide or trinucleotide repeats) with the Sputnik program. We 
subtracted the lower allele of the CEPH sample 1347-02 (CEPH genomics 
repository) from the alleles of the microsatellites and used it as a reference. We 
detected SNPs in the gene by PCR sequencing exonic and intronic regions from 
affected individuals and controls. We also detected public polymorphisms by 
BLAST search of the National Center for Biotechnology Information SNP data- 
base. We genotyped SNPs using a method for detecting SNPs with fluorescent 
polarization template-directed dye- terminator incorporation 47 and TaqMan 
assays (Applied Biosystems). 

Isolation and activation of peripheral blood neutrophils. We drew 50 ml of 
blood from each of 41 individuals with myocardial infarction and 35 age- and 
sex-matched controls into vacutainers containing EDTA. All blood was drawn 
at the same time in the early morning after 12 h of fasting. We isolated neu- 
trophils using Ficoll-Paque PLUS (Amersham Biosciences). . 

We collected the red cell pellets from the Ficoll gradient and then lysed red 
blood cells in 0.165 M ammonium chloride for 10 min on ice. After washing 
them with phosphate-buffered saline, we counted neutrophils and plated them 
at 2 x 10 6 cells ml- 1 in 4-ml cultures of 15% fetal calf serum (GIBCO BRL) in 
RPM1-1640 medium (GIBCO BRL). We then stimulated cells with maximum 
effective concentration of ionomycin ( 1 u,M). At 0, 1 5, 30, 60 min after adding 
ionomycin, we aspirated 600 ul of culture medium and stored it at -80 °C for 
the measurement of LTB4 release as described below. We maintained cells at 
37 °C in a humidified atmosphere of 5% carbon dioxide-95% air. We treated all 
samples with indomethasine ( 1 uM) to block the cyclooxygenase enzyme. 

Ionomycin-induced release of LTB4 in neutrophils. We used the LTB4 
Immunoassay (R&D systems) to quantify LTB4 concentration in supernatant 
from cultured ionomycin-stimulated neutrophils. The assay we used is based on 
the competitive binding technique in which LTB4 present in the testing samples 
(200 uJ) competes with a fixed amount of alkaline phosphatase-labeled LTB4 for 
sites on a rabbit polyclonal antibody. During the incubation, the polyclonal anti- 
body becomes bound to a goat antibody to rabbit coated onto the microplates. 
After washing to remove excess conjugate and unbound sample, a substrate solu- 
tion was added to the wells to determine the bound enzyme activity. We stopped 
the color development and read the absorbance at 405 nm. The intensity of the 
color is inversely proportional to the concentration of LTB4 in the sample. Each 
LTB4 measurement using the LTB4 1 mmunoassay was done in duplicate. 

British study population. We recruited three separate British cohorts as 
described previously 48,49 . The first two cohorts comprised 549 individuals from 



among those who were admitted to the coronary care units of the Leicester 
Royal Infirmary, Leicester (July 1993-April 1994), and the Royal Hallamshire 
Hospital, Sheffield (November 1995-March 1997), and satisfied the WHO cri- 
teria for acute myocardial infarction in terms of symptoms, elevations in car- 
diac enzymes or electrocardiographic changes 50 . We recruited 532 control 
individuals in each hospital from adult visitors of individuals with noncardio- 
vascular disease on general medical, surgical, orthopedic and obstetric wards to 
find subjects representative of the source population from which the affected 
individuals originated. Individuals who reported a history of coronary heart 
disease were excluded. 

In the third cohort, we recruited 204 individuals retrospectively from the 
registries of three coronary care units in Leicester. All had suffered a myocardial 
infarction according to WHO criteria before the age of 50 years. At the time of 
participation, individuals were at least 3 months from the acute event The con- 
trol cohort comprised 198 individuals with no personal or family history of 
premature coronary heart disease, matched for age, sex and current smoking 
status with the cases. We recruited control individuals from three primary care 
practices located in the same geographical area. In all cohorts, individuals were 
white of Northern European origin. Local research ethics committees approved 
all the studies, and individuals provided written informed consent for use of 
samples in genetic studies of coronary artery disease. 

URLs. The Sputnik program is available at http://espressosoftware.com/pages/ 
sputnik.jsp. The National Center for Biotechnology Information SNP database 
is available at http://www.ncbi.nlm.nih.gov/SNP/index.html. 

Note: Supplementary information is available on the Nature Genetics website. 
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Association between the Gene Encoding 5-Lipoxygenase-Activating Protein 
and Stroke Replicated in a Scottish Population 

A. Helgadottir, 1 S. Greta rsdottir, 1 D. St. Clair, 2 A. Manoiescu, 1 J. Cheung, 2 C. Thorleifsson, 1 
A. Pasdar, 2 S. F. A. Grant, 1 L J. Whalley, 2 H. Hakonarson, 1 U. Thorsteinsdottir, 1 A. Kong, 1 
J. Gulcher 1 K. Stefansson, 1 and M. J. MacLeod 2 

'deCODE Genetics, Reykjavik; and 2 Aberdeen Royal Infirmary and University of Aberdeen Medical School, Aberdeen, Scotland 

Cardiovascular diseases, including myocardial infarction (MI) and stroke, most often occur on the background of 
atherosclerosis, a condition attributed to the interactions between multiple genetic and environmental risk factors. 
We recently reported a linkage and association study of MI and stroke that yielded a genetic variant, HapA, in 
the gene encoding 5-lipoxygenase-activating protein (ALOX5AP), that associates with both diseases in Iceland. We 
also described another ALOXSAP variant, HapB, that associates with MI in England. To further assess the con- 
tribution of the ALOXSAP variants to cardiovascular diseases in a population outside Iceland, we genotyped seven 
single-nucleotide polymorphisms that define both HapA and HapB from 450 patients with ischemic stroke and 
710 controls from Aberdeenshire, Scotland. The Icelandic at-risk haplotype, HapA, had significantly greater fre- 
quency in Scottish patients than in controls. The carrier frequency in patients and controls was 33.4% and 26.4%, 
respectively, which resulted in a relative risk of 1.36, under the assumption of a multiplicative model (P = .007). 
We did not detect association between HapB and ischemic stroke in the Scottish cohort. However, we observed 
that HapB was overrepreserited in male patients. This replication of haplotype association with stroke in a population 
outside Iceland further supports a role for ALOXSAP in cardiovascular diseases. 



Cardiovascular diseases (CVDs), such as coronary heart 
disease and stroke, are major causes of death and dis- 
ability in western societies (Aboderin et al. 2002). As a 
result of the increasing age of the population, the preva- 
lence of CVD is rising worldwide (American Heart As- 
sociation 2002). CVDs are largely attributed to athero- 
sclerosis, which has various environmental and genetic 
risk factors. It is a commonly held view that chronic in- 
flammation initiates and promotes the development of 
atherosclerotic lesions (Lusis 2000; Libby 2002). Large 
epidemiologic studies have demonstrated correlations be- 
tween increased production of markers of systemic in- 
flammation and future cardiovascular events, including 
myocardial infarction (MI) (Ridker et al. 1997, 1998; 
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Danesh et al. 2000) and stroke (Di Napoli et al. 2001), 
which supports a central role for inflammation in CVD. 

We recently published the association of a variant in 
the gene encoding 5-lipoxygenase-activating protein 
(ALOXSAP [MIM 603700]) with both MI and stroke 
in an Icelandic population (Helgadottir et al. 2004). 
ALOXSAP, which encodes an important component of 
the leukotriene pathway, was identified through a ge- 
npmewide linkage scan conducted on 296 families with 
MI and subsequent analysis that determined association 
with markers within the mapped region on chromosome 
13ql2-13. A haplotype spanning ALOXSAP, HapA, de- 
fined by four SNPs, was shown to be associated with MI 
(relative risk = 1.8; P = .0000023) and, subsequently, 
the same variant was found to confer risk of stroke in 
Iceland (relative risk [RR] = 1.7; P = .000095) (Helga- 
dottir et al. 2004). Another SNP-based haplotype within 
ALOXSAP, HapB, showed significant association with 
MI in British cohorts from Leicester and Sheffield 
(RR = 2.0; P = .00037) (Helgadottir et al. 2004). We 
further demonstrated that leukotriene B4 (LTB4) syn- 
thesis by neutrophils from patients with a history of MI 
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is greater than the synthesis by those from controls with- 
out MI (Helgadottir et al. 2004). 

In the present study, we attempted to replicate the 
association of ALOXSAP with stroke in a population 
outside Iceland. The SNPs defining HapA (SG13S25, 
-SG13S114, SG13S89, and SG13S32) and HapB 
(SG13S377, SG13S114, SG13S41, and SG13S3S) were 
genotyped for 450 Scottish patients who had experienced 
a stroke and for 710 controls. The patient and control 
cohorts have been described elsewhere (MacLeod et al. 
1999; Meiklejohn et al. 2001; Duthie et al. 2002; Whal- 
ley et al. 2004). In brief, 450 patients from northeastern 
Scotland with CT confirmation of ischemic stroke (in- 
cluding 26 patients with transient ischemic attack [TIA]) 
were recruited between 1997 and 1999, within 1 wk of 
admission to the Acute Stroke Unit at Aberdeen Royal 
Infirmary. Patients were further subclassified in accor- 
dance with the TOAST (Irial of Org 10172 in Acute 
Stroke Treatment) research criteria (Adams et al. 1993). 
Of the patients, 155 (34.4%) had large-vessel stroke, 96 
(21.3%). had cardiogenic stroke, and 109 (24.2%) had 
small-vessel stroke; for 5 (1.1%) of the patients, stroke 
with other determined etiology was diagnosed, 7 (1.6%) 
had more than one etiology, and 78 (17.3%) had un- 
known cause of stroke despite extensive evaluation. A 
total of 710 control individuals with no history of stroke 
or TIA were recruited during follow-up of the 1921 
(n = 227) and 1936 (n = 371) Aberdeen Birth Cohort 
Studies originally recruited in 1932 and 1947, respec- 
tively, as part of the Scottish mental surveys (Deary et 
al. 2004). A further 112 controls were recruited from 
local primary-care practices (Meiklejohn et al. 2001). 
Basic clinical characteristics of patients and control in- 
dividuals are shown in table 1. Approval for the study 
was granted by the local research ethics committee, and 
all study participants gave written informed consent. 

The haplotype analysis was performed using the pro- 
gram NEMO (Gretarsdottir et al. 2003). NEMO handles 
missing genotypes and uncertainty with phase through a 
likelihood procedure, by use of the expectation-maxi- 
mization algorithm as a computational tool to estimate 
haplotype frequencies. Since we were testing only two 
haplotypes, which had been shown elsewhere to confer 
risk of MI and stroke in an Icelandic cohort and MI in 
an English cohort, the reported P values are one sided. 
For the at-risk haplotypes, we calculated RR and popu- 
lation-attributable risk (PAR) under the assumption of 
a multiplicative model (Falk and Rubinstein 1987; Ter- 
williger and Ott 1992) in which the risk of the two alleles 
of haplotypes a person carries multiplies. 

The results of the haplotype-association analysis for 
HapA and HapB are shown in table 2. The haplotype 
frequencies of HapA in the Scottish populations (patient 
and control) were higher than in the corresponding Ice- 
landic populations (table 2). As demonstrated in the Ice- 



Table 1 

Clinical Characteristics of Scottish Patients 
and Control Individuals 





Patients 


Controls 


Characteristics 


(» = 450) 


(n = 710) 


Female: male 


42:58 


49:51 


Age (years) 


66.8 ±.6 


67.2 ±.4 


Hypertension (%) 


55.5 


/ 23.9 


Diabetes {%) 


12.6 


2.1 


Total cholesterol (mmol/liter) 


5.65 ±.06 


5.64 ±.05 



NOTE.— Patients and control individuals were classified 
as having hypertension and/or diabetes on the basis of 
previous history or receipt of antihypertensive or anti- 
diabetic therapy. Values with plus-minus symbol ( ± ) are 
mean ± SE. 



landic population, the estimated frequency of HapA was 
significantly greater in Scottish patients who have suf- 
fered a stroke than in Scottish controls. The carrier fre- 
quency of HapA in Scottish patients and controls was 
33.4% and 26.4%, respectively, which resulted in an RR 
of 1.36 (P = .007) and a corresponding PAR of 9.6%. 
We had previously observed in the Icelandic population 
a higher frequency of HapA in male than in female pa- 
tients with either stroke or MI (Helgadottir et al. 2004). 
This sex difference in the frequency of HapA was not 
observed in the Scottish population (table 2). 

We then tested the association of HapB with stroke 
in the Scottish cohort. HapB has been shown elsewhere 
to confer risk of MI in an English cohort (Helgadottir 
et al. 2004). A slight excess of HapB was observed in 
the patient group (6.8%) compared with controls (5.8%), 
but it was not significant (table 2). However, sex-specific 
analysis showed that the frequency of HapB was higher 
in males with ischemic stroke (9.2%) than in controls, 
resulting in an RR of 1.65 (P = .016). The frequency of 
HapB in females with ischemic stroke was 3.5%, which 
was lower but not significantly different from that of 
controls. The frequencies of HapB in males and females 
with ischemic stroke differed significantly (P = .0021). 
Interestingly, as shown in table 2, similar trends were 
observed in our Icelandic cohort; the frequency of HapB 
was greater in males with ischemic stroke (8.6%) than 
in females with ischemic stroke (5.8%), although this 
was not significant (P = .055). 

To summarize our results, we demonstrate in the pre- 
sent study that HapA, the risk haplotype of ALOXSAP, 
reported elsewhere to confer risk of MI and stroke in 
an Icelandic cohort, associates with ischemic stroke in 
a Scottish cohort. HapB, which confers risk of MI in an 
English cohort, was not associated with ischemic stroke 
in the Scottish cohort. However, we observed that HapB 
was overrepresented in male patients. 

Historical and archaeological data have suggested a 
Gaelic ancestry for both Icelanders and Scots. This is 



Reports 



000 



Table 2 

Analysis of Association of HapA and HapB with Ischemic Stroke 



HapA 



HapB 



Location and Study Population (») Frequency RR 



Frequency RR 



Scotland: . 

Controls (710) .142 

Patients with ischemic stroke (450 a ): .184 

Males (253) -183 

. Females (181) .179 

Iceland: 

Controls (624) .095 

Patients with ischemic stroke (632): .147 

Males (335) .155* 

Females (297) .138 



1.36 
1.35 
1.34 



i:63 

1.75 
1.51 



.007 
.023 
.044 



.00013 

.0002 

.0079 



.058 
.068 
.092 
.035 

.067 
.073 
.086 
.058 



1.20 
1.65 
.58 



1.09 
1.31 
.86 



NS 
.016 
NS 



NS 
NS 
NS 



Note.— Shown are HapA and HapB of ALOXSAP and the corresponding number of individuals 
genotyped, the haplotype frequency in the patient and control cohorts, the RR, and the one-sided P 
values HapA is defined by the SNPs SG13S25, SG13SU4, SG13S89, and SG13S32, with alleles G, 
T, G, and A, respectively, and HapB is defined by the SNPs SG13S377, SG13S114, SG13S41, and 
SGt3S3S, with alleles A, A, A, and G, respectively. For SNP genotyping, we used TaqMan assays 
(Applied Biosystems) or the fluorescent-polarization template-directed dye-terminator incorporation 
(the SNP-FP-TDI assay), as described elsewhere (Chen et al. 1999). SNP information can be found in 
the dbSNP database. The DNA used for the SNP genotyping was the product of whole-genome 
amplification, by use of the GenomiPhi Amplification kit (Amersham), of DNA isolated from the 
peripheral blood of the Scottish controls and patients with stroke. Data on the Icelandic cohort have 
been reported elsewhere (Helgadottir et al. 2004). NS = not significant. 

' Sex unknown for 16 patients. 



further supported by recent studies of mtDNA and Y- 
chromosome diallelic and microsatellite variation in Ice- 
landers, Scandinavians, and Gaels from Ireland and Scot- 
land (Helgason et al. 2000, 2001). Given this common 
ancestry, it is possible that the two populations share a 
disease-causing variant and that this variant may reside 
on the same common haplotype background (HapA). 
Such a scenario would be consistent with our results; 
although the estimated RR for HapA in the Scottish 
cohort is somewhat lower than in the Icelandic cohort, 
this difference is not statistically significant. Indeed, a 
similar observation has been made in previous studies 
of schizophrenia in Iceland and Scotland (Stefansson et 
al. 2003), in which the same extended haplotype was 
found to confer risk of schizophrenia in both popula- 
tions, with comparable frequencies in patient and con- 
trol groups in the two countries. 

The gene ALOXSAP encodes the membrane-associ- 
ated 5-lipoxygenase-activating protein (FLAP), ah impor- 
tant mediator of the activity of cellular 5-lipoxygenase 
(5-LO), which is a key enzyme in the biosynthesis of 
leukotrienes (Dixon et al. 1990; Miller et al. 1990). Leu- 
kotrienes are proinflammatory mediators produced pre- 
dominantly in inflammatory cells such as polymorpho- 
nuclear leukocytes, macrophages, and mast cells. Over 
the last decade, a number of studies have supported an 
important role for inflammation in atherosclerosis — from 
atheroma initiation to promotion of plaque rupture, 
thereby triggering thrombosis, the main atherosclerotic 
complication that causes MI and stroke (Libby 2002). 



The 5-LO pathway could be an important contributor 
to the pathophysiology of atherosclerosis through the 
formation of the proinflammatory LTB4 and/or through 
an increase in vascular permeability caused by cysteinyl 
leukotrienes. Indeed, we have shown increased produc- 
tion of LTB4 in neutrophils from patients with history 
of MI, compared with controls without history of MI 
(Helgadottir et al. 2004). This is further supported by 
recent human-expression studies (Spanbroek et a.1. 2003) 
that show an increased expression of members of the 5- 
LO pathway, including 5-LO and FLAP, in atheroscle- 
rotic lesions at various stages of their development. 
Moreover, a promoter variant of 5-LO (ALOXS [MIM 
152390]) has been shown to be associated with increased 
carotid artery intima-media thickness and with height- 
ened inflammatory biomarkers (Dwyer et al. 2004). In 
addition, an atherosclerotic mouse model with a hetero- 
zygous deficiency of 5-LO shows resistance to athero- 
sclerosis (Mehrabian et al. 2002), and an LTB4 receptor 
antagonist blocks the development of atherosclerosis in 
apoE- and LDLR-deficient mice (Aiello et al. 2002; 
Mehrabian et al. 2002). Together, these studies suggest 
that chronic upregulation of the leukotriene pathway 
may be harmful to the vasculature, in terms of athero- 
sclerosis progression and plaque instability. 

The precise mechanism by which the ALOXSAP vari- 
ants confer risk of MI and stroke is still unclear. As 
reported elsewhere, we have not observed SNPs in the 
coding sequence that led to amino acid substitution (Hel- 
gadottir et al. 2004). Therefore, one can speculate that 
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unidentified variation in regulatory regions of the gene— 
that affects transcription, splicing, message stability, mes- 
sage transport, or translation efficiency — may underlie 
the risk conferred by ALOX5AR 

The results of the present study show that HapA as- 
sociates with ischemic stroke in a Scottish population, 
thereby providing replication of work that showed that 
the same haplotype confers increased risk of stroke in 
an Icelandic population. This replication constitutes ad- 
ditional evidence for the role of ALOX5AP in the patho- 
genesis of stroke. Identification of genetic risk factors for 
the common forms of stroke may facilitate identification 
of individuals at increased risk and may lead to novel 
strategies for the prevention and treatment of stroke. 
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ALOX5AP gene variants and risk of coronary artery 
disease: an angiography-based study 

Domenico Girelli*' 1 , Nicola Martinelli 1 , Elisabetta Trabetti 2 , Oliviero Olivieri 1 , 

Ugo Cavallari 2 , Giovanni Malerba 2 , Fabiana Busti 1 , Simonetta Friso 1 , Francesca Pizzolo , 

Pier Franco Pignatti 2 and Roberto Corrocher 1 

department of Clinical and Experimental Medicine, University of Verona, Verona, Italy; department of Mother and 
Child and Biology-Genetics, University of Verona, Verona, Italy 

The aim of this study was to explore the role of variants of the gene encoding arachidonate 5- 
lipoxygenase-activating protein (ALOXSAP) as possible susceptibility factors for coronary artery disease 
(CAD) and myocardial infarction (Ml) in patients with or without angiographically proven CAD. A total of 
1431 patients with or without angiographically documented CAD were examined simultaneously for seven 
ALOXSAP single-nucleotide polymorphisms, allowing reconstruction of the at-risk haplotypes (HapA and 
HapB) previously identified in the Icelandic and British populations. Using a haplotype-based approach, 
HapA was not associated with either CAD or Ml. On the other hand, HapB and another haplotype within 
the same region (that we named HapC) were significantly more represented in CAD versus CAD-free 
patients, and these associations remained significant after adjustment for traditional cardiovascular risk 
factors by logistic regression (HapB: odds ratio (OR) 1.67, 95% confidence interval (CI) 1.04-2.67; 
P = 0.032; HapC: OR 2.41, 95% CI 1.09-5.32; P= 0.030). No difference in haplotype distributions was 
observed' between CAD subjects with or without a previously documented Ml. Our angiography-based 
study suggests a possible modest role of ALOXSAP in the development of the atheroma rather than in its 
late thrombotic complications such as Ml. 
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Introduction 

Interest in unraveling the genetic basis of coronary artery 
disease (CAD) has been recently renewed by results 
obtained applying powerful approaches such as genome- 
wide scan studies. 1 " 3 At variance with classic association 
studies involving single-nucleotide polymorphisms (SNPs) 
in candidate genes, genome-wide scan studies have the 
advantage of discovering new gene(s), without a priori 
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hypothesis. A paradigm of the successful use of such 
strategies was the identification of arachidonate 5-lipoxy- 
genase-activating protein (ALOXSAP) as a susceptibility 
gene for myocardial infarction (MI) and stroke. 4 Interest- 
ingly, ALOXSAP encodes the 5-lipoxygenase-activating 
protein (FLAP), which is an essential regulator of the 
biosynthesis of the leukotriene A4 (LTA4). 5 ' 6 Indeed, the 
5-lipoxygenase (5-LO)/leukotriene pathway has been inde- 
pendently implicated in the pathogenesis of atherosclero- 
sis in humans 7,8 and mice 9 (reviewed by Zhao and Funk 10 ). 
While not successful in discovering causal variants in 
ALOXSAP, the original study by Helgadottir et al 4 identified 
a 4-SNP haplotype, named HapA, as a risk factor for MI 
and stroke in the Icelandic population. The Authors were 
unable to confirm the result in a cohort of British patients 
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with MI; however, in such cohort they reported an 
association of another 4-SNP haplotype, named HapB, 
with MI risk. 4 Few subsequent studies on ALOX5AP in 
different populations yielded conflicting results. Lohmus- 
saar et al 11 studied Central European patients with stroke, 
finding a significant association for several ALOX5AP SNPs, 
including one that was part of HapA. On the other hand, 
studies in North Americans failed to show a significant 
association with either stroke or MI. 1213 

To date, no genetic -epidemiological data are available 
for populations from Southern Europe. Moreover, none of 
the previous studies specifically attempted to dissect the 
role of ALOXSAP in the atherosclerosis phenotype rather 
than in its 'complication' phenotype (MI). We therefore 
evaluated simultaneously seven ALOXAS SNPs and their 
reconstructed haplotypes as possible risk determinants for 
CAD and MI within the framework of an Italian population 
with or without angiographically confirmed CAD. 



Materials and methods 
Study population 

The Verona Heart Project is an ongoing study aimed to 
identify new risk factors for CAD and MI in a population of 
subjects with angiographic documentation of their cor- 
onary vessels. Details about the enrolment criteria have 
been described elsewhere. 14 In the present study, we 
examined data from a total of 1431 subjects, for whom 
complete analyses of seven ALOXSAP SNPs (see below) 
were available. Of these subjects, 1047 had angiographi- 
cally documented severe coronary atherosclerosis (CAD 
group), the majority of them being candidates for coronary 
artery bypass grafting or percutaneous coronary interven- 
tion. The disease severity was evaluated by counting the 
number of major epicardial coronary arteries (left anterior 
descending, circumflex, and right) affected with ^1 
significant stenosis (^50%). On the other hand, 384 
subjects had completely normal coronary arteries, being 
submitted to coronary angiography for reasons other than 
CAD, mainly valvular heart disease (CAD-free group). 
Controls were also required to have neither history nor 
clinical or instrumental evidence of atherosclerosis in 
vascular districts beyond the coronary bed. Since the 
primary aim of our selection was to provide an objective 
and clear-cut definition of the atherosclerotic phenotype, 
subjects with nonsignificant coronary stenosis (ie <50%) 
were not included in the study. CAD subjects were 
classified into MI and non-MI subgroups by combining 
data from history with a thorough review of medical 
records showing diagnostic electrocardiogram and enzyme 
changes, and/or the typical sequelae of MI on ventricular 
angiography. An appropriate documentation was obtained 
for 1046/1047 (99.9%) CAD patients: from those 624 
subjects had a history of previous Ml, whereas the 
remaining 422 subjects had no history of MI. The 



angiograms were assessed by two cardiologists unaware 
that the patients were to be included in the study. Samples 
of venous blood were drawn from each subject after 
an dvernight fast. Serum lipids and the other routine 
biochemical parameters were determined as described 
previously. 14 At the time of blood sampling, a complete 
clinical history was collected, including the assessment 
of cardiovascular risk factors such as obesity, smoking, 
hypertension, and diabetes. 

The study was approved by our local Ethical Committee. 
Informed consent was obtained from all the patients after a 
full explanation of the study. 

Genotyping 

To make possible comparison with studies in other 
populations, we selected seven previously described 
ALOXSAP (GenelD: 241; chromosome: 13ql2) SNPs 
(SG13S25, SG13S377, SG13S114, SG13S89, SG13S32, 
SG13S41 and SG13S35), maintaining their original no- 
menclature, 4 as well as the nomenclature of the recon- 
structed haplotypes. The seven SNPs were initially tested by 
PCR and restriction analyses (Supplementary Table 1) in a 
small group of randomly chosen DNA samples in order to 
verify the heterozygosity in the study population. All the 
samples were then genotyped in two multiplex reactions 
for six SNPs (SG13S377, SG13S41, SG13S32, and SG13S114 
in multiplex one, Ml; SG13S25, SG13S35 in multiplex two, 
M2) using LightCycler™ real-time PCR technology based 
on fluorescence resonance energy transfer and melting 
point analysis. The sequences of primers and probes used 
for the six SNPs genotyping with melting point analysis 
are shown in Supplementary Table 2. Both primers and 
fluorescently labelled probes were synthesized by Sigma- 
Proligo (Proligo France SAS). PCR and melting curve 
analysis was performed in 20^1 volumes in glass capillaries 
(Hoffmann-La Roche). PCR conditions for Ml and M2 are 
detailed in Supplementary Tables 3 and 4, respectively. 
Cycling and melting curve analysis conditions were 
different for the two multiplex reactions, as given in 
Supplementary Table 5. As the SG13S89 polymorphism 
was not easily detectable in a multiplex reaction, it was 
genotyped by PCR and restriction analysis for all the 
samples, using the following primers forward (F): 5'- 
AAGTGCATCTCAAGGAGGT-3' and reverse (R) 5 ; -ATTAG 
CAGAAGAGCCAAGT-3'. 

Statistical analysis 

The analyses were performed mainly with SSPS 13.0 
statistical package (SPSS Inc., Chicago, IL, USA). Distribu- 
tions of continuous variables in groups were expressed as 
means ±SD. Quantitative data were assessed using the 
Student's r-test. Associations between qualitative variables 
were analysed with the y} test or Fisher exact test, when 
indicated. Allele and genotype frequencies among cases 
and controls were compared with values predicted by 
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Hardy-Weinberg equilibriurrTusing x 2 test. To assess the 
association with CAD or MI, relative risks associated with 
each genotype were calculated separately by . univariate 
logistic regression and then by multiple logistic regression 
adjusted for the traditional cardiovascular risk factors (ie 
sex, age, smoking, hypertension, diabetes, total cholesterol, 
and triglycerides), assuming an additive, dominant or 
recessive mode of inheritance. 

Pairwise linkage disequilibrium was examined as de- 
scribed by Devlin and Risen. 15 Haplotype frequencies 
were estimated using R software with haplo.stats package 
(R Foundation for Statistical Computing, Vienna, Austria; 
ISBN 3-900051-07-0, URL: http://www.R-project.org). 16 
The upper and lower bounds of the 95% confidence 
interval (CI) were calculated by simulating 1000 random 
samples from a population having the haplotype frequen- 
cies estimated on the entire sample set. The U measure was 
calculated for each simulated sample. The upper and lower 
bounds represent the quantiles corresponding to the 0.025 
or 0.975 probabilities of the D distribution. Haplotype 
blocks were defined as proposed by Gabriel et al} 7 The 
relationship between haplotypes and clinical outcomes 
was examined using a generalized linear model regression 
of a trait on haplotype effects, allowing for ambiguous 
haplotypes (haplo.glm function), 16 and adjusting for the 
above-mentioned risk factors. Randomization test by 
permuting the cases and controls was performed by means 
of Monte Carlo method to confirm the results. Haplotypes 
present in less than 10 individuals were not considered in 
the analyses. 

The study power was assessed by means of the Altman 
nomogram, after adjustment for the asymmetric distribu- 
tion of population subgroups (CAD-free versus CAD; non- 
Mi versus MI). The study has adequate power (>90%) to 
replicate the findings for odds ratios (ORs) greater than 2.0, 
which is consistent with those observed in the previous 
studies. 4 For each OR, 95% CIs were calculated. A value of 
two-tailed P<0.05 was considered significant. 



Results 

Table la summarizes the clinical characteristics of the 
study population stratified according to the presence 
(CAD) or absence (CAD-free) of angiographically docu- 
mented CAD. As expected, traditional cardiovascular risk 
factors were more represented in the CAD group. The 
characteristics of the CAD population, divided in two 
subgroups according to the presence or absence of a 
previous documented MI, are reported in Table lb. The 
genotype frequencies for the polymorphisms tested were 
in Hardy-Weinberg equilibrium both in the CAD and 
CAD-free groups. 

Allele and genotype distributions were similar either 
between CAD and CAD-free groups (Table 2a) or within 



Table la Clinical characteristics of the study population 
stratified according to absence (CAD-free) or presence 
(CAD) of angiographically documented CAD 



Age (years) 
Males (%) 
BMI (kg/m 2 ) 
Hypertension (%) 
Smoking (%) 
Diabetes (%) . 
Total cholesterol 
(mmol/l) 

Triglycerides (mmol/l) 



CAD-free 


CAD 


(r\ = 384) 


(n=1047) 


58.7 + 12.3 


61.2 + 9.8 


65.6 


79.8 


25.4 + 3.5 


26.8 + 3.5 


40.5 


66.3 


43.9 


67.9 


6.8 


19.2 


5.47±1.10 


5.54 ±1.1 7 


1.49 ±0.67 


1.91 ±0.99 



P-values 

< 0.001* 

< 0.001* 
0.936* 

<0.001* 

< 0.001* 

< 0.001* 
0.322* 

< 0.001* 



*By f-test; *by x 2 test 

Table lb Clinical characteristics of the CAD patients, 
with (Ml) or without (no-MI) a previous documented Ml 



No-M/ 
(r\ = 422) 



Ml (n = 624) P-values 



Age (years) 62.5 ±8.9 

Males (%) 75.6 

BMI (kg/m 2 ) 27.0 ±3.5 

Hypertension (%) 72.3 

Smoking (%) 62.3 

Diabetes (%) 19.3 
Total cholesterol (mmol/l) 5.6± 1 .1 

Triglycerides (mmol/l) 1 .9 ± 0.9 

GAD severity 

One vessel 24.3 

Two vessels 26.0 

Three vessels 48.1 
Left main coronary artery 1 .7 



60.4 ±10.2 

82.7 
26.6±3.6 
62.1 
71.7 
19.2 
5'.5 ±1.2 
1.9 f 1.0 

12.4 
24.1 
61.8 
1.7 



0.002* 
0.005* 
0.583* 
0.001* 
0.002* 
0.592* 
0.677* 
0.396* 



< 0.001* 



*By r-test; *by x 2 test. 



CAD subjects with or without a previous MI (Table 2b). 
Results from the regression analyses, assuming additive, 
dominant or recessive mode of inheritance, showed no 
significant association of the gene variants tested with the 
clinical outcomes (data not shown). In general, the SNPs 
tested were in linkage disequilibrium, as shown in Table 3. 

Considering haplotype analysis, the most frequent 
haplotypes were G-T-G-C and G-T-A-G for HapA SNPs and 
HapB/C SNPs, respectively, and thus were used as the 
referents. The haplotype distributions for HapA SNPs were 
similar between CAD and CAD-free subjects (F = 0.937). On 
the other hand, the haplotype distributions for HapB SNPs 
were significantly different between CAD and CAD-free 
subjects (P = 0.014), as shown in Table 4a. More precisely, 
two haplotypes A-A-A-G (HapB) and G-T-A-A (that we 
named HapC) were more represented in CAD group (7.5 
versus 5.5% and 3.7 versus 1.6%, respectively), and these 
associations remained significant also after adjustment for 
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Table 2a ALOX5AP genotype and allele distribution in the 
study population stratified according to absence (CAD- 
free) or presence (CAD) of angiographically documented 
CAD 



ALOXSAP 
genotype, % 

SC13S25 
GC 
CA 
AA 

G allele 
A allele 

SG13S377 
GG 
GA 
AA 

G allele 
A allele 

SG13S1W 
TT 
TA 
AA 

T allele 
A allele 

SG13S89 
GG 
GA 
AA 

G allele 
A allele 

SG13S32 
AA 
AC 
CC 

C allele 
A allele 



CAD-free (n = 384) CAD (n = 1 047) P-values* 



85.7 
13.8 

0.5 
92.6 

7.4 



76.8 
20.8 
2.3 
87.2 
12.8 



42.4 
41.7 
15.9 
63.3 
36.7 



86.7 
12.8 

0.5 
93.1 

6.9 



22.9 
51.6 
25.5 
51.3 
48.7 



SGI3S41 




AA 


81.8 


AG 


17.4 


GG 


0.8 


A allele 


90.5 


G allele 


9.5 



SG13S35 
GG 
GA 
AA 

G allele 
A allele 



83.9 
15.6 

0.5 
91.7 

8.3 



84.6 
14.8 

0.6 
92.0 

8.0 



73.9 
24.6 
1.4 
86.2 
13.8 



40.6 
44.8 
14.6 
63.0 
37.0 



87.5 
12.1 

0.4 
93.6 

6.4 



22.6 
49.9 
27.5 
52.4 
47.6 



81 .9 
17.3 

0.8 
90.6 

9.4 



81 .5 
18.1 

0.4 
90.5 

9.5 



0.885 
0.682 

0.180 

0:530 

0.559 
0.921 

0.794 
0.727 

0.747 
0.620 

0.997 
0.995 

0.528 
0.396 



*By x 2 test or Fisher's exact test. 



traditional cardiovascular risk factors, that is, sex, age, 
smoking, hypertension, diabetes, total cholesterol, and 
triglycerides (Table 4b). The significance of the general 
model, including genetic factors arranged as haplotypes, 
was confirmed after randomization test (P = 0.022 for 
general model, P = 0.013 for HapB and P = 0.021 for HapC, 
after 1000 permutations). 



Table 2b ALOXSAP genotype and allele distribution in 
the CAD group stratified according to absence (no-MI) or 
presence (Ml) of previously documented Mi 

ALOXSAP genotype No-MI (n = 422) Ml (n = 624) P-values* 



SG13S2S 
GG 
GA 
AA 

G allele 
A allele 

SG13S377 
GG 
GA 
AA 

G allele 
A allele 

SG13S114 
TT 
TA 
AA 

T allele 
A allele 

SG13S89 
GG 
GA 
AA 

G allele 
A allele 

SG13S32 
AA 
AC 
CC 

C allele 
A allele 

SG7354J 
AA 
AG 
GG 

A allele 
G allele 

SG13S3S 
GG 
GA 
AA 

G allele 
A allele 



84.6 
14.2 

1.2 
91.7 

8.3 



72.0 
26.3 
1.7 
85.2 
14.8 



37.7 
47.4 
14.9 
61 .4 
38.6 



87.4 
12.3 

0.2 
93.6 

6.4 



23.9 
49.1 
27.0 
51.5 
48.5 



83.4 
16.1 

0.5 
91.5 

8.5 



81.3 
18.7 

0 
90.6 

9.4 



84.6 
15.2 

0.2 
92.2 

7.8 



75.2 
23.6 
1.3 
87.0 
13.0 



42.6 
42.9 
14.4 
64.1 
35.9 



87.7 
11.9 

0.5 
93.6 

6.4 



21.8 
50.3 
27.9 
53.0 
47.0 



81.1 
17.9 

1.0 
90.1 

9.9 



81.7 
17.6 

0.6 
90.5 

9.5 



0.107 
0.727 

0.509 
0.283 

0.263 
0.222 

0.868 
0.936 

0.719 
0.528 

0.515 
0.315 

0.282 
0.997 



*By x 2 test or Fisher's exact test. 



There was no difference in haplotype distributions 
between CAD subjects with or without a previous MI, 
either for HapA region or for HapB region (Table 4c). 



Discussion 

The present investigation in Italian patients provides some 
evidence that the ALOXSAP gene might play a role in 
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SG1 3S25 

SG13S377 

SG13S114 

SC1 3S89 

SGI 3S32 

SC13S41 

SG13S35 



SC13S2S 


SG13S377 


SG13S114 


SC13S89 


SC13S32 


SG13S41 


SG13S35 




0.013 


0.045 


0.006 


0.087 


0.007 


0.002 


1.000 




0.185 


0.003 .. 


0.113 


0.007 


0.145 


0.956 


0.834 




0.072 


0.038 


0.105 


0.026 


1.000 


0.544 


0.774 




0.050 


0.463 


0.004 


0.969 


0.815 


0.245 


0.882 




0.093 


0.077 


0.920 


0.666 


0.770 


0.828 


0.987 




0.009 


0.529 


0.473 


0.392 


0.790 


0.836 


0.914 





jy, Lewontin normalized value; R, correlation coefficient. 



conferring susceptibility to CAD also in South European 
populations. Nonetheless, since the statistically significant 
association we found was relatively weak, the role of 
this gene, if any, seems modest. To put our results into a 
more general perspective, we propose the following 
considerations. 

Comparison with previous studies 
The landmark study by Helgadottir et al identified two 
different haplotypes as CAD risk factors in populations of 
different ancestry. According to the haplotype block 
definition proposed by Gabriel et a/, 17 we observed three 
haplotype blocks of two SNPs each (block 1: SG13S25- 
SG12S377; block 2: SG13S32-SG13S41; block 3: SG13S41- 
SG13S35). Therefore, the SNPs describing HapA or HapB/C 
do not define a single haplotype block. This finding is 
consistent with what observed by Helgadottir et al. 4 

In Icelandics (a genetic isolate), the HapA conferred a 
nearly twofold risk of MI. 4 This was not confirmed in 
British patients, in whom, on the other hand, a different 4- 
SNPs haplotype (HapB) was associated with Ml. Neither 
HapA nor HapB was associated to incident MI in a cohort 
of male US physicians. 13 To make possible a comparison, 
we focused on a standardized set of seven ALOXSAP SNPs, 
allowing reconstruction of the same at-risk haplotypes 
reported in the literature. With respect to HapA, our results 
suggest that this haplotype may not be informative for 
risk assessment of CAD in non-Icelandic populations. With 
respect to HapB, a modest contribution of this genetic 
marker to CAD risk was observed. Haplotype analyses 
revealed in our population a nominally significant associa- 
tion between CAD and another ALOXSAP haplotype 
('HapC'), unremarkable in previous studies. Considering 
also the low frequency of this haplotype, the relevance of 
this finding remains uncertain. The observed differences 
among populations are not surprising, and may relate in 
part to population-specific differences in allele and haplo- 
type frequencies (for a summary of previous studies see 
Table 5). For example, the frequency of HapA in Icelandic 
controls (9.5%) was well below that observed in North 
American (15%), German (15.2%), and our Italian (18.6%) 
populations. Moreover, it has to be underscored that we are 



dealing with disease-risk-associated haplotypes made of 
SNPs with no obvious potential effects on function, whose 
association(s) with yet unidentified causal variant(s) in 
ALOXSAP may differ between populations with differing 
genealogies. In other words, it would not be unexpected to 
find in the future different pathogenic ALOXSAP muta- 
tion^), with different frequencies among populations, 
arising on different haplotype background. Noteworthy, 
a replication study in a Japanese population 18 found an 
allele frequency of HapA/HapB SNPs too low to conduct 
meaningful association. Nevertheless, in that population 
haplotypes constructed on the basis of two other intronic 
SNPs were significantly associated with Ml. 

ALOX5AP, leukotriene pathway, and CAD 
pathogenesis 

Preliminary functional data by Helgadottir et al indicated 
that some at-risk haplotypes were associated to increased 
neutrophil release of leukotriene B4 (LTB4). Being LTB4 
synthesized from LTA4, it implies that ALOXSAP variants 
might determine proinflammatory gain of functions. The 
role of inflammation in CAD pathogenesis is now well- 
established (reviewed by Hansson 19 ). The FLAP protein 
encoded by ALOXSAP has an important role in the initial 
steps of the biosynthesis of leukotrienes, 5 ' 6 which in turn 
have a variety of proinflammatory effects. 20 Besides the 
ALOXSAP story, genetic evidence for the involvement of 
the 5-LO/leukotriene pathway in CAD is accumulating. 21 " 23 
The same Icelandic group recently reported that another 
gene involved in this pathway, that is, leukotriene A4 
hydrolase, conferred risk of CAD, especially in African 
Americans. 21 Dwyer et al 22 found an association between 
promoter variations of the ALOXS gene (encoding 5-LO, 
ie the FLAP target) and carotid intima-media thickness 
(a preclinical surrogate marker of atherosclerosis). As a 
functional counterpart of intriguing genetic studies, a bulk 
of animal experiments have linked the 5-LO pathway to 
atherosclerosis, although results are sometimes discordant 
(critically reviewed by Funk 23 ). Interestingly, many of basic 
researches leading to the 'lipoxygenases hypothesis' 24 ' 25 
points towards an involvement in early events of atheroma 
development, through LTB4-mediated migration and 
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Table 4 ALOX5AP haplotype distribution in the study 
population stratified according to the presence or absence 
of angiographically documented CAD (a); ORs with 95% 
CIs for CAD for HapB region haplotypes, calculated by 
means of haplotype-based logistic regression analysis 
adjusted for traditional risk factors for CAD, that is, sex, 
age, smoking, hypertension, diabetes, and plasma lipids 
(b); ALOX5AP haplotype distribution in the CAD group 
stratified according to absence (no-MI) or presence (Ml) of 
previously documented myocardial infarction (c) 



ALOXSAP haplotype 



CAD-free (%) CAD (%) P-values* 



HapA SNPs (SG13S25, SG13S114, SG13S89, SG13S32) 

G-T-G-A (HapA) a 18.6 16.9 0.937 

G-T-G-C 35.8 37.5 

G-A-G-A 22.7 22.3 

C-A-G-C 8.6 8.8 

G-A-A-C 5.5 5.5 

A-T-G-A 7.4 7.8 

HapB/C SNPs (SGI 35377, SG13S114, SG13S41, SG13S35) 



G-T-A-G 
G-T-A-A (HapC) a 
G-T-C-G 
G-A-A-G 
G-A-G-G 
A-T-A-C 

A-A-A-G (HapB) a 
A-A-A-A 

ft) 

HapB/C SNPs 



58.3 


56.8 


1.6 


3.7 


1.4 


1.1 


17.1 


16.0 


7.3 


7.7 


1.5 


1.0 


5.5 


7.5 


4.8 


4.6 



0.014 



OR for CAD 



P-values* 



P-values* 



G-T-A-A (HapC) a 2.41 (1.09-5.32) 0.030 0.021 

A-A-A-G (HapB) a 1.67(1.04-2.67) 0.032 0.013 



(c) 

ALOXSAP haplotype 



No-MI (%) 



Ml(%) 



P-values* 



HapA SNPs (SG13S25, SG13S114, SG13S89, SG13S32) 

G-T-G-A (HapA) a 15.7 17.6 0.587 

G-T-G-C 36.6 38.2 

G-A-G-A 24.1 21.2 

G-A-G-C 8.9 8.8 

G-A-A-C 5.7 5.4 

A-T-G-A 8.3 7.5 

HapB/C SNPs (SGI 3S377, SG13S114, SG13S41, SG13S35) 



G-T-A-G 
G-T-A-A (HapC) a 
G-T-G-G 
G-A-A-G 
G-A-G-G 
A-T-A-G 

A-A-A-G (HapB) a 
A-A-A-A 



55.6 


57.5 


3.5 


3.9 


0.9 


1.3 


17.2 


15.2 


6.9 


8.3 


1.0 


1.1 


8.4 


7.0 


4.5 


4.7 



0.547 



*By regression analysis. 

*By regression analysis adjusted for sex, age, smoking, hypertension, 

diabetes, total cholesterol, and triglycerides. 

A By randomization test after 1000 permutations. 

a HapA is defined by SG13S25, SG13S114, SG13S89, and SG13S32 

SNPs, with alleles G, T, G, A, respectively. HapB and C are defined by 

SG13S377, SG13S114, SG13S41, and SGI 3S35 SNPs, with alleles A, 

A, A, G, or G, T, A, A, respectively. 

Bold characters underscore the haplotypes with a significant different 
distribution. 



Table 5 Frequencies of HapA, HapB, and some ALOXSAP 
SNPs in studies published so far (in patients with Ml or 
stroke) and in our study 



Study (author's 
name) 


Controls 
(%) 


Cases (%) 


P-values 


Helgadottir et al, A 








Icelandic cohort 








HapA 


9.5 


15.8 a , 


<0.001*, 




14.9 b 


<0.001 b 


SG13S114 allele T 


OJ.O 


7ft ft a 

/u.u 


n n?i 


British cohort 








HapA 


16.8 


15.1 a 


NS 


HapB 


4.0 


7.5 a 


< 0.001 


Lohmussaar et a/, 11 








German cohort 








HapA 


15.2 


14.5 b 


NS 


SGI 3S25 allele G 


90.1 


89.4 b 


NS 


SG13S114 allele T 


65.0 


68.5 b 


0.02S ■ 


SGI 3S89 allele G . 


96.0 


94.7 b 


NS 


SG13S32 allele A 


46.7 


46.9 b 


NS 


Nolnnrinttir of nl^ 








jcoiusn cunun 








Han A 


14 y 


1 8.4 b 


0.007 


HapB 


5.8 


6.8 b 


NS 


Kniimntn Pt nl * ® 
nUjlfilULU ct til, 








Japanese cohort 






0.SS7 


SG1 3S25 allele G 


99.97 


100 a 


SG1 3S377 allele G 


81.6 


80.0 a 


0.243 


SCI 3 S1 14 allele T 

JVJ | j j l i » u/fcrc i 


64.7 


64.1 a 


0.298 


^r.l oIIpIp G 

JVJ l JJ07 liffCJC VJ 


99.2 


99.0 a 


0.603 


v,i 3^3? niipip a 

JVJ 1 jjj^ UffCfc Al 


64.9 


65. l a 


0.428 


SC.I 1S41 oIIpIp a 

j vj i j j*t i untie n 


99.2 


98!7 a 


0.303 


SCI IS^S allplp G 

JVJ 1 JJJJ UIICIC VJ 


100 


100 a 




A162C allele C 


48.8 


44.7 a 


0.129 


T8733A allele A 


43.6 


42i6 a 


0.S70 


Haplotype 


20.0 


25.8 a 


0.003 


162A-8733A 








Haplotype 


23.6 


16.9 a 


0.001 


162C-8733A 








Meschiaet o/, 12 








North American 








cohort 








SG13S25 allele G 


87.9 


89.7 b 


0.200 


SG13S114 allele T 


57.8 


59.1 b 


0.180 


SGI 3S89 allele G 


87.4 


91 .2 b 


0.1 SO 


SGI 3S32 allele A 


49.2 


51. 3 b 


0.790 



Zee et a/, 13 
US cohort 
HapA 
HapB 

SG13S25 o//e/eG 
SG13S377 allele G 
SG13S114 allele T 
SG1 3S89 allele G 
SGI 3S32 allele A 
SGI 3S41 allele A 
SG13S35 allele G 

This study, 2006 
Italian cohort 
HapA 



14 c , 15 d 
7 C , 7 d 
90 c , 1 d 
87 c , 83 d 
68 c , 63 d 
95 c , 94 d 
46 c , 52 d 
91 c , 92 d 
91 c , 89 d 



18.6 



17 a , 18 b 

6 a , 8 b 
90 a , 90 b 
88 a , 87 b 
68 a , 63 b 
94 a , 94 b 
50 a , 52 b 
91 a , 92 b 
93 a , 91 b 



16.9 e 



0.460*, 
0.080*, 
0.890*, 
0.410*, 
0.630*, 
0.840*, 
0.1 S0* t 
0.730*, 
0.210*, 



0.710* 
0.470* 
0.470* 
0.1 S0 b 
0.990* 
0.960* 
0.990* 
0.680* 
0.260* 



0.937 
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Table 5 (Continued) 



JLUUy yUUtllUI J 

name) 


Controls 

<%) 


Cases (%) 


P-values 


HapB 


5.5 


7.5 e 


0.014 


SG1 3S25 allele G 


92.6 


92.0 e 


0.682 


SG13S377 allele C 


87.2 


86.2 e 


0.530 


SG13S114 allele T 


63.3 


63.0 e 


0.921 


SG13S89o//e/eC 


93.1 


93.6 e 


0.727 


SG13S32 allele A 


48.7 


47.6 e 


0.620 


SG13S41 allele A 


90.5 


90.6 e 


0.995 


SG13S35 allele C 


91 .7 


90.5 e 


0.396 



NS, nonsignificant. 
a Myocardial infraction. 
b Stroke. 

c Control group for myocardial infraction. 
d Control group for stroke. 
e Coronary artery disease. 

Italic numbers indicate the characteristics of case and control groups. 



activation of monocyte/macrophages, as well as lipoxy- 
genases-mediated LDL oxidation. 

Peculiarities of the present study: strengths and 
limitations 

Previous studies on ALOX5AP focused on MI patients versus 
controls selected from the general population or from 
event-free subjects such as in the prospective Physician's 
Health Study cohort. 413 Ml is usually a late thrombotic 
complication superimposed on coronary atherosclerotic 
plaque rupture, 26 so that design of previous studies did 
not directly allow to separate a putative specific role of 
ALOXSAP in MI rather than in CAD development. Our 
alternative experimental design focused on subjects with 
angiographically proven CAD, with or without a previous 
documented MI. Moreover, the angiography-based design 
enabled us to select CAD-free subjects with an objectively 
denned control status, a critical issue in genetic association 
studies. 27 This allowed us to overcome the caveat, common 
in Western general populations where atherosclerosis is 
endemic, of enrolling controls with substantial coronary 
atherosclerotic lesions, although not yet clinically evident. 
While our CAD-free subjects cannot be considered a 
'typical' control group, we feel confident about their 
acceptable representativity of the background general 
population, being their genotype and haplotype distribu- 
tions, not fundamentally different from those observed 
in controls from German and US populations (see above). 
Since we noted haplotype. differences only between the 
whole CAD group versus the CAD-free group, and not 
between CAD patients with or without MI, our data appear 
to be consistent with a more relevant role of ALOX5AP in 
atherogenesis rather than in thrombogenesis, according to 
many of the above-mentioned biochemical data. 

This study suffers of common limitations of genetic 
association studies with complex traits. 28 Despite the 
unbalance between case and controls, it was sufficiently 



powered to detect a predefined effect of ALOXSAP 
haplotypes on CAD (see above). On the other hand, we 
could riot properly analyse some interesting issues such as 
a possible stronger effect of ALOXSAP in males than in 
females, 4 because of the limited number of women 
enrolled. 

Finally, from a possible practical perspective it has to be 
taken into account the relatively poor frequency of 'at-risk' 
haplotypes in our population, as well as their modest effect 
on the CAD risk. 



Conclusions 

ALOXSAP represents the paradigm of a new class of 
promising genes identified by powerful genome-wide 
investigations, which is currently an object of intense 
investigations to confirm their role in CAD susceptibility. 
Our data neither refute nor strongly support this hypo- 
thesis. Adding them to current knowledge, some evidence 
on ALOXSAP as a genetic susceptibility factor for CAD has 
now emerged in four out of five independent populations 
(Icelandic, British, Japanese, and Italian; but not in North 
America). Our angiography-based study suggests a possible 
role of ALOXSAP/TLAP in the development of the atheroma 
rather than in its late thrombotic complications such as 
MI. Such a role, if any, appears to be modest. Much further 
work is needed to understand the reason(s) for hetero- 
geneous results, as well as to identify possible ALOXSAP 
pathogenic variations. 



Acknowledgements 

This work was supported by grants from the Veneto Region, Italian 
Ministry of University and Research (Grant no. 200S/06S1S2), and 
the Cariverona Foundation, Verona, Italy. We wish to thank Mrs 
Maria Zoppi for invaluable secretary help, and Mr Diego Minguzzi for 
technical assistance. The authors declare that they have no potential 
conflict of interests. 



References 

1 Watkins H, Farrall M: Genetic susceptibility to coronary artery 
disease: from promise to progress. Nat Rev Genet 2006; 7: 
163-173. 

2 Lusis AJ, Fogelman AM, Fonarow GC: Genetic basis of athero- 
sclerosis, Part I, New genes and pathways. Circulation 2004; 110: 
1868-1873. 

3 Wang Q: Molecular genetics of coronary artery disease. Curr Opin 
Cardiol 2005; 20: 182-188. 

4 Helgadottir A, Manolescu A, Thorleifsson G et al: The gene 
encoding 5-lipoxygenase activating protein confers risk of 
myocardial infarction and stroke. Nat Genet 2004; 36: 233-239. 

5 Dixon RAF, Diehl RE, Opas E et al: Requirement of a 5- 
lipoxygenase-activating protein for leukotriene synthesis. Nature 
1990; 343: 282-284. 

6 Miller DK, Gillard JW, Vickers PJ etal: Identification and isolation 
of a membrane protein necessary for leukotriene production. 
Nature 1990; 343: 278-281. 



European Journal of Human Genetics 



ALOXSAP gene variants and coronary artery disease 

D Cirelli et al 



966 

7 Spanbroek R, Grabner R, Lotzer K et al: Expanding expression of 
5-lipoxygenase pathway within the arterial wall during human 
atherogenesis. Proc Natl Acad Sci USA 2003; 100: 1238-1243. 

8 Qiu H, Gabrielsen A, Agardh HE et ah Expression of 5- 
lipoxygenase and leukotriene A4 hydrolase in human athero- 
sclerotic lesions correlates with symptoms of plaque instability. 
Proc Natl Acad Sci USA 2006; 103: 8161-8166. 

9 Mehrabian M, Allayee H, Wong J et al: Identification of 5- 
lipoxygenase as a major gene contributing to atherosclerosis 
susceptibility in mice. Circ Res 2002; 91: 120-126. 

10 Zhao L, Funk CD: Lipoxygenase pathways in atherogenesis. 
Trends Cardiovasc Med 2004; 14: 191 - 195. 

11 Ldhmussaar E, Gschwendtner A, Mueller JC etal: ALOXSAP gene 
and the PDE4D gene in a central European population of stroke 
patients. Stroke 2005; 36: 731-736. 

12 Meschia JF, Brott TG, Brown RD et al: Phosphodiesterase 4D and 
5-lipoxygenase activating protein in ischemic stroke. Ann Neurol 
2005;58:351-361. 

13 Zee RY, Cheng S, Hegener HH, Erlich HA, Ridker PM: Genetic 
variants of arachidonate 5-lipoxygenase activating protein and 
risk of incident myocardial infarction and ischemic stroke. Stroke 
2006; 37: 2007-2011. 

14 Girelli D, Russo C, Ferraresi P et al: Polymorphisms in the factor 
VII gene and the risk of myocardial infarction in patients with 
coronary artery disease. N Engl J Med 2000; 343: 774- 780. 

15 Devlin B, Risen N:. A comparison of linkage disequilibrium 
measures for fine-scale mapping. Genomics 1995; 29: 311-322. 

16 Lake SL, Lyon H, Tantisira K et al: Estimation and tests of 
haplotype -environment interaction when linkage phase is 
ambiguous. Hum Hered 2003; 55: 56-65. 

17 Gabriel SB, Schaffner SF, Nguyen H et al: The structure of 
haplotype blocks in the human genome. Science 2002; 296: 
2225-2229. 

Supplementary Information accompanies the paper on European 



18 Kajimoto K, Shioji K, Ishida C et al: Validation of the association 
between the gene encoding 5-lipoxygenase-activating protein 
and myocardial infarction in a Japanese population. Circ / 2005; 
69:1029-1034. 

19 Hansson GK: Inflammation, atherosclerosis, and coronary artery 
disease. N Engl J Med 2005; 352: 1685-1695. 

20 Samuelsson B: Leukotrienes: mediators of immediate hypersensi- 
tivity reactions and inflammation. Science 1983; 220: 568-575. 

21 Helgadottir A, Manolescu A, Helgason A et al: A variant of the 
gene encoding leukotriene A4 hydrolase confers ethnicity- 
specific risk of myocardial infarction. Nat Genet 2006; 38: 68-74. 

22 Dwyer JH, Allayee H, Dwyer KM et al: Arachidonate 5-lipoxigen- 
ase promoter genotype, dietary arachidonic acid and athero- 
sclerosis. N Engl ) Med 2004; 350: 29-37. 

23 Funk CD: Leukotriene modifiers as potential therapeutics for 
cardiovascular disease. Nat Rev Drug Discov 2005; 4: 664-672. 

24 Steinberg D: At last, direct evidence that lipoxygenases play a role 
in atherogenesis. / Clin Invest 1999; 103: 1487-1488. 

25 Lotzer K, Funk CD, Habenicht AJR: The 5-lipoxygenase pathway 
in arterial wall biology and atherosclerosis. Biochim Biophys Acta 
2005; 1736: 30-37. 

26 Lusis AJ: Atherosclerosis. Nature 2000; 407: 233-241. 

27 Lander ES, Schork NJ: Genetic dissection of complex traits. Science 
1994; 265: 2037-2048. 

28 Colhoun HM, McKeigue PM, Davey Smith G: Problems of 
reporting genetic associations with complex outcomes. Lancet 
2003; 361: 865-872. 

29 Helgadottir A, Gretarsdottir S, St Clair D et al: Association 
between the gene encoding 5-lipoxygenase-activating protein 
and stroke replicated in a Scottish population. Am ] Hum Genet 
2005; 76: 505-509. 

Journal of Human Genetics website (http://www.nature.com/ejhg) 



European Journal of Human Genetics 



Genetic Variants of Arachidonate 5-Lipoxygenase- 
Activating Protein, and Risk of Incident Myocardial 
Infarction and Ischemic Stroke 

A Nested Case-Control Approach 

Robert YL. Zee, PhD; Suzanne Cheng, PhD; Hillary H Hegener, BS; Henry A. Erlich, PhD; Paul M Ridker, MD 

Background and Purpose— Recent findings have implicated specific gene polymorphisms of arachidonate 5-lipoxygen- 
ase-activating protein (ALOX5AP), and 2 at-risk haplotypes (HapA, HapB) in myocardial infarction and stroke. To 
date, no prospective data are available. 

Methods— We evaluated 10 specific Icelandic ALOX5AP gene variants among 600 male participants with incident 
atherothrombotic events (myocardial infarction [MI] or ischemic stroke) and among 600 age- and smoking-matched 
male participants, all white, who remained free of reported cardiovascular disease during follow-up within the 
Physicians' Health Study cohort. 

Results— Overall allele, genotype, and haplotype distributions were similar between cases and controls. Single-marker 
conditional logistic regression analysis adjusted for potential risk factors found no association with risk of atherothrombotic 
events. Further investigation using a haplotype-based approach showed similar null findings with MI (HapA: odds ratio 
[OR] =1.1 8, 95% CI, 0.76 to 1.85; P=0A6\ HapB: odds ratio=0.62, 95% CI, 0.36 to 1.07; />=0.08), and with ischemic stroke 
(HapA: odds ratio=l.ll, 95% CI, 0.65 to 1.89; P=0.71; HapB: odds ratio=0.82, 95% CI, 0.47 to 1.42; P=0.47). 

Conclusions— We found no evidence for an association of the specific Icelandic ALOX5P gene variants/at-risk haplotypes 
tested with risk of incident MI nor ischemic stroke in this prospective, non-Icelandic study. (Stroke. 2006;37:2007-2011.) 

Key Words: ALOX5AP ■ haplotypes ■ MI ■ risk factors ■ stroke 



Cardiovascular diseases, including myocardial infarc- 
tion (MI) and ischemic stroke, are the leading causes 
of mortality and morbidity in western countries. The 
underlying pathogenesis is likely to be mediated by both 
genetic and environmental risk factors. The initial report, 1 
in an Icelandic population, of a significant association of 
genetic variants of arachidonate 5-lipoxygenase-activating 
protein (ALOX5AP) with increased risk of MI and stroke 
has attracted great interest. In their study, Helgadottir and 
coauthors reported a linkage and association of a 4-single- 
nucleotide polymorphism (SNP) haplotype, HapA, of 
ALOX5AP gene with risk of MI and stroke. 1 In addition, 
they reported an association of a different 4-SNP haplo- 
type, HapB, with risk of MI in a British population. 1 
Helgadottir and coauthors further assessed the contribution 
of ALOX5AP variants, in particular the HapA, and HapB 
haplotypes, to stroke, in a Scottish population, and found 
that the HapA haplotype confers a relative risk of 1.36 
assuming a multiplicative model (P=0.007) for stroke. 2 
However, they found no association for HapB. Subsequent 



studies by others in several non-Icelandic populations have 
since yielded conflicting results. 3 - 4 

To date, no prospective genetic-epidemiological data are 
available on risk of MI, and ischemic stroke. We therefore 
simultaneously evaluated the role of 10 ALOX5AP (GenelD: 
241; Chromosome: 13ql2) SNPs (SG13S25, SG13S377, 
SG13S106, SG13S114, SG13S89, SG13S30, SG13S32, 
SG13S41, SG13S42, and SG13S35), and specific haplotypes 
thereof, in particular HapA, and HapB at-risk haplotypes, as 
risk determinants of incident MI, and ischemic stroke in a 
prospective, nested case-control sample within the Physi- 
cians' Health Study (PHS) cohort. These polymorphisms 
(except SG13S106, SG13S30, and SG13S42: unpublished 
data from deCODE Genetics) were chosen based on the 
associations observed in the Icelandic study. 1 

Materials and Methods 

Study Design 

We used a nested case-control design within the PHS, 5 a random- 
ized, double-blinded, placebo-controlled trial of aspirin and beta 
carotene initiated in 1982 among 22 071 males, predominandy white 
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(>94%), US physicians, 40 to 84 years of age at study entry. Before 
randomization, 14916 participants provided an EDTA-anticoagulated 
blood sample and stored for genetic analysis. All participants were free 
of prior Ml, stroke, transient ischemic attacks, and cancer at study 
entry. As the study participants were all US male physicians, yearly 
follow-up self-report questionnaires provide reliable updated infor- 
mation on newly developed diseases and the presence or absence of 
other cardiovascular risk factors. History of cardiovascular risk 
factors, such as hypertension (> 140/90 mm Hg or on antihyperten- 
sive medication), diabetes or hyperlipidemia (>240 mg/dL), was 
defined by self-report of diagnosis at entry into the study. For all 
reported incident vascular events occurring after study enrollment, 
hospital records, death certificates, and autopsy reports were re- 
quested and reviewed by an end-points committee using standardized 
diagnostic criteria. 

The diagnosis of Ml was confirmed by evidence of symptoms in 
the presence of either diagnostic elevations of cardiac enzymes or 
diagnostic changes on electrocardiograms. In the case of fatal events, 
the diagnosis of Ml was also accepted based on autopsy findings. 
Stroke was defined by the presence of a new focal neurological 
deficit, with symptoms and signs persisting for >24 hours, and was 
ascertained from blinded review of medical records, autopsy results 
and the judgment of a board-certified neurologist, on the basis of 
clinical reports, computed tomographic, or MRI scanning. 

For each case (MI or ischemic stroke), a control matched by age, 
smoking history (never, past, or current) and length of follow-up 
were chosen among those subjects who remained free of vascular 
diseases. The present association study consisted of 341 MI case- 
control pairs, and 259 ischemic stroke case-control pairs, all white 
males. 

The study was approved by the Brigham and Women's Hospital 
Institutional Review Board for Human Subjects Research. 

Genotyping Determination 

Genotyping was performed using an immobilized probe approach, as 
previously described (Roche Molecular Systems). 6 In brief, each 
DN A sample was amplified in a multiplex polymerase chain reaction 
using biotinylated primers. Each polymerase chain reaction product 
pool was then hybridized to a panel of sequence-specific oligonu- 
cleotide probes immobilized in a linear array. The colorimetric 
detection method was based on the use of streptavidin-horseradish 
peroxidase conjugate with hydrogen peroxide and 3,3', 5,5'- 
tetramethylbenzidine as substrates. 

To confirm genotype assignment, scoring was carried out by 2 
independent observers. Discordant results (<1% of all scoring) were 
resolved by a joint reading, and where necessary, a repeat genotyp- 
ing. Results were scored blinded as to case-control status. Overall 
completion rate of genotyping determination was >95%. 

Statistical Analysis 

Allele and genotype frequencies among cases and controls were 
compared with values predicted by Hardy-Weinberg equilibrium 
using the x 2 test. Relative risks associated with each genotype were 
calculated separately by conditional logistic regression analysis 
conditioning on the matching by age, smoking status, and length of 
follow-up since randomization, and further controlling for random- 
ized treatment assignment, history of hypertension, presence or 
absence of diabetes, and body mass index, assuming an additive, 
dominant, or recessive mode of inheritance. Pairwise linkage dis- 
equilibrium (LD) was examined as described by Devlin and Risen. 7 
For comparison with published reports by others, we examined 2 
previously described at-risk haplotypes: HapA (SG13S25G- 
SG13S1147-SG13S89G-SG13S32^), and HapB (SG13S377>4- 
SG13S1144-SG13S4M-SG13S35G). Haplotype estimation and in- 
ference was determined using PHASE v2.1. 8 * 9 Haplotype 
distributions between cases and controls were examined by likeli- 
hood ratio test. The relationship between haplotypes and clinical 
outcomes was examined using a haplotype-based logistic regression 
analysis with baseline-parameterization, 10 adjusting for the same risk 
factors. All analyses were carried out using SAS/Genetics 9.1 



TABLE 1 . Baseline Characteristics of Study Participants Who 
Subsequently Developed Any Arterial Event (Cases), and Those 
Who Remained Free of Vascular Disease During Follow-Up 
(Controls) 





Controls 

fn=fift0\ 

\l l uuu/ 


Cases 
m=600) 


P 


Age.y 


60.8±0.3 


bl .UxU.o 


ro.v. 


Smoking status, % 






ro.v. 


Never 


41.7 


AA 7 




Past 


41.5 


41 .5 




Current 


1C D 

lb.o 


1fi ft 




Body mass index, kg/m 2 


24.9 ±0.1 


25.4 ±0.1 


U.UUl 


Blood pressure, mm Hg 








Systolic 


128.6±0.5 


132.7±0.6 


<0.0001 


Diastolic 


79.6±0.3 


81.8±0.3 


<0.0001 


Hyperlipidemia, % 


14.9 


22.8 


<0.001 


Hypertension, % 


29.0 


47.2 


<0.0001 


Diabetes, % 


2.8 


8.9 


<0.0001 


Aspirin use, % 


46.3 


44.8 


0.61 


Family history of premature 


8.9 


10.9 


0,24 


CAD <60 years of age, % 









package (SAS Institute, Inc). For each odds ratio (OR), we calculated 
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Mean±SE unless otherwise stated, 
m.v. indicates matching variable; CAD, coronary artery disease. 
Continuous and categorical variables were tested by paired t test and 
McNemar test, respectively. 

95% CIs. A 2-tailed P value of 0.05 was considered a statistically 
significant result. 

Results 

Baseline characteristics of cases and controls are shown in 
Table 1. As expected, the case participants had a higher 
prevalence of traditional cardiovascular risk factors at base- 
line as compared with controls. The genotype frequencies for 
the polymorphisms tested were in Hardy-Weinberg equilib- 
rium in the control group and in the case group. 

Using a single-marker x* analysis, allele and genotype 
distributions were similar between cases and controls 
(Table 2). Results from the adjusted conditional logistic 
regression analysis, assuming additive, dominant, or reces- 
sive mode of inheritance, showed no significant associa- 
tion of the variants tested with the clinical outcomes 
(P>0.07; data not shown). In general, the polymorphisms 
tested were in LD (supplemental Table I, available online 
at http://stroke.ahajournals.org). The overall haplotype 
distributions between cases and controls were similar (MI: 
HapA region, P=0.79, HapB region, P=0.94; ischemic 
stroke: HapA region, P=0.77, HapB region, P=0.26; 
supplemental Table II, available online at http:// 
stroke.ahajournals.org). The most frequent haplotypes 
were G-T-G-C, and G-T-A-G for HapA region, and HapB 
region, respectively (supplemental Table II), and thus were 
used as the referents. Results from the adjusted haplotype- 
based conditional logistic regression analysis again 
showed similar null findings (supplemental Table III, 
available online at http://stroke.ahajournals.org). 
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TABLE 2. Genotype and Allele Distribution 



AL0X5AP Genotype, % 



Ml Controls Ml Cases 



IsST Controls 



IsST Cases 



SG13S25 

GG 

GA 

AA 
Allele 

G 

A 

SG13S377 

GG 

GA 

AA 
Allele 

G 

A 

SG13S106 

GG 

GA 

AA 
Allele 

G 

A 

SG13S114 

7T 

TA 

AA 
Allele 

T 

A 

SG13S89 

GG 

GA 

AA 
Allele 

G 

A 

SG13S30 

GG 

GT 

TT 
Allele 

G 

T 

SG13S32 

CC 

CA 

AA 
Allele 

C 

A 



81.31 
18.07 
0.62 

0.90 
0.10 

75.39 
23.05 
1.56 

0.87 
0.13 

50.16 
37.69 
12.15 

0.69 
0.31 

47.04 
41.43 
11.53 

0.68 
0.32 

89.72 
9.66 
0.62 

0.95 
0.05 

58.57 
37.69 
3.74 

0.77 
0.23 

27.73 
52.96 
19.31 

0.54 
0.46 



80.56 
19.14 
0.31 

0.90 
0.10 

78.09 
20.68 
1.23 

0.88 
0.12 

46.60 
41.98 
11.42 

0.68 
0.32 

45.37 
42.28 
12.35 

0.68 
0.32 

88.89 
10.80 
0.31 

0.94 
0.06 

58.95 
36.42 
4.63 

0.77 
0.23 

22.84 
54.63 
22.53 

0.50 
0.50 



0.80 



0.89 



0.71 



0.41 



0.54 



0.59 



0.90 



0.63 



0.76 



0.84 



0.83 



0.91 



0.30 



0.15 



83.13 
15.64 
1.23 

' 0.91 . 
0.09 

70.37 
25.93 
3.70 

0.83 
0.17 

45.27 
44.86 
9.88 

0.68 
0.32 

41.56 
43.62 
14.81 

0.63 
0.37 

89.71 
9.47 
0.82 

0.94 
0.06 

51.85 
41.15 
7.00 

0.72 . 
0.28 

24.28 
47.33 
28.40 

0.48 
0.52 



79.58 
20.00 
0.42 

0.90 
0.10 

75.42 
22.50 
2.08 

0.87 
0.13 

45.00 
40.00 
15.00 

0.65 
0.35 

42.08 
42.50 
15.42 

0.63 
0.37 

89.17 
10.42 
0.42 

0.94 
0.06 

57.92 
36.67 
5.42 

0.76 
0.24 

20.83 
54.17 
25.00 

0.48 
0.52 



0.29 



0.47 



0.35 



0.15 



0.20 



0.38 



0.96 



0.99 



0.80 



0.96 



0.38 



0.17 



0.32 



0.99 



{Continued} 
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TABLE 2. Continued 



AL0X5AP Genotype, % 


Ml Controls 


Ml Cases 


P 


IsST Controls 


IsST Cases 


P 


SG13S41 






0.50 






0.89 


AA 


82.87 


83.02 




84.36 


85.42 




AG 


15.58 


16.36 




14.40 


13.75 




GG 


1.56 


0.62 




1.23 


0.83 




Allele 






0.73 






0.68 


A 


0.91 


0.91 




0.92 


0.92 




G 


0.09 


0.09 




0.08 


0.08 




SG13S42 






0.17 






0.36 


AA 


28.04 


34.88 




38.68 


35.00 




AG 


50.78 


45.99 




43.62 


50.00 




GG 


21.18 


19.14 




17.70 


15.00 




Allele 






0.11 






0.88 


A 


0.53 


0.58 




0.60 


0.60 




G 


0.47 


0.42 




0.40 


0.40 




SG13S35 






0.08 






0.50 


GG 


81.31 


85.80 




79.42 


83.33 




GA 


18.69 


13.58 




19.75 


16.25 




AA 




0.62 




0.82 


0.42 




Allele 






0.21 






0.26 


G 


0.91 


0.93 




0.89 


0.91 




A 


0.09 


0.07 




0.11 


0.09 





IsST indicates ischemic stroke. 
P value for ^ test. 



association between HapA and an increased risk of ischemic 
stroke (relative risk=1.35; ^=0.02), and an over-representation 
of HapB (relative risk= 1 .65; P=0.02) with ischemic stroke in a 
Scottish male sample population 2 (Table 3). Recently, Lohmus- 
sar and coauthors 3 reported that sequence variants in the 
ALOX5AP gene are significantly associated with stroke, partic- 
ularly in males, in a Central European sample population. A 
nominally significant association with stroke was observed for 
SG13S114 (OR=1.24; />=0.017), and SG13S100 (OR=1.26; 
p=0.024). However, they found no association of HapA with 
stroke risk. 3 More recently, Meschia and coauthors conducted 
the first replication study using a North American sample 

TABLE 3. Summary of AL0X5AP At-Risk-Haplotypes Association Studies __ 



HapA HfPB 





Ml 


Stroke 


Ml 


Stroke 




Conf, Casf, R, P 


Conf, Cast, R, P 


Conf, Cast, R, P 


Conf, Cast, R, P 


Present study United States 


0.14, 0.17, 1.18, 0.46 


0.18, 0.15,1.11,0.71 


0.07, 0.06, 0.62, 0.08 


0.08, 0.07, 0.82, 0.47 


Iceland 1 


0.10, 0.16, 1.80, <0.0001 


0.10, 0.15, 1.67, <0.0001 


Not available 


*0.07, 0.07, 1.09, ns 


United Kingdom 1 


0.15, 0.17, ns 


Not available 


0.04, 0.08, 1.95, 0.00037 


Not available 


Scotland 2 


Not available 


0.14, 0.18, 1.35, 0.02 


Not available 


0.06, 0.09, 1.65, 0.02 


Germany 3 


Not available 


0.15, 0.15, ns 


Not available 


ns (data not shown) 


North America 4 


Not available 


ns (data not shown) 


Not available 


Not available 



Conf indicates haplotype frequency in controls; Casf, haplotype frequency in cases; R, risk estimate; ns, nonsignificance. 
HapA=SG1 3S25G-SG1 3S1 1 4 7-SG1 3S89G-SG1 3S32A HapB= SG1 3S377ASG1 3S1 1 44-SG1 3S41 4-SGt 3S35G. 



*Data extracted from reference 2. 
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Discussion 

The present prospective investigation provides no evidence 
for an association of the specific gene variants, nor at-risk 
haplotypes of the ALOX5AP gene, previously suggested as 
genetic risk determinants, with MI or stroke in a non- 
Icelandic white population. 

In the initial Icelandic report, 1 a 4-SNP haplotype (HapA) was 
found to be associated with a 2X greater risk of MI, and an 
almost 2X greater risk of stroke. The same group also reported 
an association of a different 4-SNP ALOX5AP haplotype 
(HapB) with risk of MI in a British sample population 1 (Table 
3). A subsequent report by Helgadottir and coauthors found an 
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population, and found no association between ALOX5AP gene 
variants and stroke, although MI was not investigated in their 
study. 

Given this situation, a possible explanation for the apparent 
discrepancies is that the observed allele, genotype, and at-risk 
haplotype frequencies for the SNPs examined may differ 
between studies, which could be the result of population/ 
ethnic differences. As previously suggested, 3 - 4 the ALOX5AP 
gene variation may play a substantial role in risk of MI, and 
stroke in Iceland (an isolate population), but a lesser role in 
non-Icelandic populations because of different population LD 
structures. These recent results are consistent with the initial 
report that different at-risk haplotypes were found between 
the Icelandic and British study populations. 1 

As shown in Table, 3, not all of the published reports 
examined the same set of SNPs, nor did all of the reported 
studies examine the association of ALOX5AP variants with 
MI and stroke simultaneously. Further, not all published 
studies presented information on allele, genotype and at-risk 
haplotype frequencies, LD structure, and risk estimates, thus 
making a direct comparison and informative interpretation 
across studies difficult. 

It has been noted in the initial report 1 that variants of 
ALOX5AP gene are involved in the pathophysiology of MI 
and stroke by increasing the production of leukotriene B4, a 
critical regulator in the 5-lipoxygenase pathway, and a 
proinflammatory agent. Leukotrienes are arachidonic acid 
metabolites, which have been implicated in various inflam- 
matory conditions, including asthma, arthritis, psoriasis, and 
atherosclerosis. 1112 Notably, a recent article by the same 
Icelandic group found a haplotype (HapK) of the gene 
encoding leukotriene A4 hydrolase, a protein in the same 
biochemical pathway of ALOX5AP, confers ethnicity- 
specific (particularly in blacks) risk of MI. 13 

The prospective nature of the PHS study and the use of a 
closed population sampling scheme in which subsequent case 
status was determined solely by the development of disease 
strongly reduce the possibility that our findings are attributable 
to bias or confounding. Our study cohort consists of entirely 
white males with distinct socioeconomic status (physicians), so 
our data cannot be generalized to other ethnic groups and 
women. In our study, we had the ability to detect, based on the 
present sample sizes, assuming 80% power, at an a of 0.05, a 
risk ratio of > 1.54 (MI), and 1.64 (ischemic stroke) if the minor 
allele frequency is 0.50, and of >2.26 (MI), and 2.49 (ischemic 
stroke) if the minor allele frequency is 0.05 assuming a univari- 
able-additive mode. Thus, we cannot rule out a modest risk of 
cardiovascular disease associated with the polymorphisms/hap- 
lotypes tested. It is important to recognize that association 
studies like this one can only examine the possible association 
between phenotype and the tested polymorphisms. Our study 
therefore cannot exclude the possibility that examination of 
different polymorphisms/loci, which would by definition have to 
be in linkage disequilibrium with the ones tested, might obtain 
different results. 

In conclusion, our prospective study found no evidence for 
an association of specific Icelandic ALOX5AP gene poly- 
morph sms/at-risk haplotypes examined with risk of athero- 
thrombotic events. If corroborated in other non-Icelandic 



prospective studies, our data suggest that ALOX5AP gene 
variation is not informative for risk assessment of athero- 
thrombosis in non-Icelandic populations. 
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No association of polymorphisms in the gene 
encoding 5-lipoxygenase-activating protein and 
myocardial infarction in a large central European 
population 

Werner Koch, PhD 1 , Petra Hoppmann, MP 1 , Jakob C Mueller, PhD 2 , Albert Schdmig, MP 1 , and Adrian Kastrat i, MP 1 
Purpose: Haplotypes based on polymorphisms in the gene encoding 5Jipoxygenase-activating protein have been 
linked with susceptibility to myocardial infarction in Iceland and the United Kingdom. We sought to replicate these 
association findings in a large case-control sample from Germany. Methods: The case group included 3657 
patients with myocardial infarction and the control group comprised 1211 individuals with angiographically normal 
coronary arteries and without clinical signs or symptoms of myocardial infarction. Nine different polymorphisms 
were genotyped with the use of the TaqMan technique. Results: Genotype, allele, and haplotype analyses did not 
reveal significant associations between the polymorphisms and myocardial infarction. The negative results 
included a four-marker haplotype, termed HapA haplotype (odds ratio = 1.10; 95% confidence interval: 0.96- 
1.25), that was previously found to be related with myocardial infarction in a sample from Iceland, and a different 
four-marker haplotype, termed HapB haplotype (odds ratio = 0.94; 95% CI: 0.79-1.12), that was previously linked with 
myocardial infarction in a sample from the United Kingdom. Ninennarker haplotypes were not significantly associated 
with myocardial infarction in multiple logistic regression models adjusted for covariates (P> 0.38). Conclusion: In this 
sample from central Europe, specific polymorphisms in the gene for 5-lipoxygenase-actiyating protein were not 
associated with myocardial infarction, a result contrasting previous positive findings. Genet Med 2007:9(2):123-129. 
Key Words: ALOX5AP, 5-lipoxygenase-activating protein (RAP) t genetics, haplotype, myocardial infarction 



Specific allelic forms of the gene encoding 5-lipoxygenase- 
activating protein (FLAP) have been linked with susceptibility 
to myocardial infarction (MI) and stroke. 1 - 4 These association 
findings may reflect a possible relationship of the regulatory 
function of FLAP in the inflammatory 5-lipoxygenase pathway 
and the important role attributed to inflammatory processes in 
atherosclerotic diseases. 5 " 9 The 5-lipoxygenase cascade leads to 
the formation of leukotrienes, which exhibit strong proinflam- 
matory activities in cardiovascular tissues. 9 " 11 This pathway is 
especially active in arterial walls of patients afflicted with vari- 
ous lesion stages of atherosclerosis of the aorta and of coronary 
and carotid arteries. 10 
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The gene for FLAP {ALOX5AP) contains five exons 
and spans approximately 31 kb in the chromosome 13ql2 
region. 12 - 13 Specific single nucleotide polymorphisms (SNPs), 
named SG13S100, SG13S106, SG13S114, and a four-marker 
haplotype of ALOX5AP, termed HapA haplotype (SG 1 3S25-G, 
SG13S114-T, SG13S89-G, SG13S32-A), were found to be re- 
lated to MI in a population sample from Iceland. 1 A different 
four-marker haplotype of ALOX5AP, termed HapB haplotype 
(SGI 3S377-A, SG13S1 14-A, SG13S41-A, SG13S35-G), but not 
the HapA haplotype, was linked with MI in a sample from the 
United Kingdom (UK). 1 No evidence of an association of the 
HapA or HapB haplotype with MI was obtained in a sample of 
white male physicians from the United States (US). 14 

We examined whether the nine different SNPs mentioned 
above, nine-marker haplotypes of these SNPs, and the HapA 
and HapB haplotypes were associated with MI in a German 
population. The sample consisted of 3657 patients with MI and 
1211 control individuals, all of whom were assessed with cor- 
onary angiography. 

METHODS 

Patients and controls 

Participants were recruited from Southern Germany and ex- 
amined at Deutsches Herzzentrum Miinchen or 1. Medizinische 
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Klinik rechts der Isar der Technischen Universitat Munchen from 
1 993 to 2002. After catheterization, 5264 individuals were deemed 
eligible for inclusion in the MI or control group. Written in- 
formed consent for genetic analysis was obtained from 97.1% 
(n = 5111) of these individuals. In no case was consent with- 
drawn. Blood samples assigned for DNA preparation had been 
collected from 95.2% (n = 4868) of the individuals who agreed to 
participate in the study. These individuals, 3657 patients with MI 
and 1211 controls, constituted the study population. Complete 
genotype data were obtained from all these patients and control 
individuals. The study protocol was approved by the Institutional 
Ethics committee and the reported investigations were in accor- 
dance with the principles of the Declaration of Helsinki. 15 

Definitions 

Individuals were considered disease free and, therefore, eli- 
gible as controls when their coronary arteries were angio- 
graphically normal and when they had no history of MI, no 
symptoms suggestive of MI, no electrocardiographic signs of 
MI, and no regional wall motion abnormalities. Coronary an- 
giography in the control individuals was performed for the 
evaluation of chest pain. The diagnosis of MI was established in 
the presence of chest pain lasting longer than 20 minutes com- 
bined with ST-segment elevation or pathologic Q waves on a 
surface electrocardiogram. Patients with MI had to show either 



an angiographically occluded infarct-related artery or regional 
wall motion abnormalities corresponding to the electrocardio- 
graphic infarct localization, or both. Systemic arterial hyper- 
tension was defined as a systolic blood pressure of ^140 mm 
Hg and/or a diastolic blood pressure of ^90 mm Hg, 16 on at 
least two separate occasions or antihypertensive treatment. 
Hypercholesterolemia was defined as a documented total cho- 
lesterol value ^240 mg/dL (>6.2 mmol/L) or current treat- 
ment with cholesterol-lowering medication. Persons reporting 
regular smoking in the previous 6 months were considered as 
current smokers. Diabetes mellitus was defined as the presence 
of an active treatment with insulin or an oral antidiabetic 
agent; for patients on dietary treatment, documentation of an 
abnormal fasting blood glucose or glucose tolerance test based 
on the World Health Organization criteria 17 was required for 
establishing this diagnosis. 

Genetic analysis 

Genomic DNA was extracted from peripheral blood leuko- 
cytes with the QIAamp DNA Blood Kit (Qiagen, Hilden, Ger- 
many) or the High Pure PCR Template Preparation Kit (Roche 
Applied Science, Mannheim, Germany). We designed and 
used TaqMan allelic discrimination assays for genotype analy- 
sis of nine SNPs in ALOX5AP (Table 1). Primers and probes 
(Table 1 ) were synthesized by Applied Biosystems (Darmstadt, 



Table 1 



deCODE 
SNP ID" 


NCB1 
dbSNP ID* 


SNP 
bases 


Position in 
AL512642 C 


Location 


Primer (5' -*3') 


Probe (5'-* 3')' 


SG13S25 




G>A 


26663 


Upstream of 
exoh 1 


TCTGACAGCATCAGCTAGTCTCTTTC 
AAATTCATGTTGCTGTGTCCATACA 


FAM-CACTGTTGCCCAGTGG 
V1C-AGCCACTGTTACCCAGT 


SG13S377 




G>A 


31075 


Upstream of 
exon 1 


TTTGGCCAGACTGTCTTGAACTC 
TGGCTCATGCCTATAATCACAAAA 


FAM-CCTGCCTCGGCCT 
VIC-CTGCCrCAGCCTC 


SG13S100 


rs4073259 


A>G 


33381 


Upstream of 
exon 1 


GGTGAAGTGGACTCCCTCCAT 
CCCCGCTCTGAGCTCCTT 


FAM-AGCCAGCGCGCAG 
V1C-CAGCCAGIGCGCAG 


SG13S106 


rs9579646 


G>A 


37689 


Intron 1 


TGTGTAGAGCTGTCTTCCTAAAGTTCTG 
AAGCCACTGGAGATAGTTATGAAAGTG 


FAM-AGTTAGGGCTGCCTC 
V1C-AGTTAGGACTGCCTCAG 


SG13S114 


rs 10507391 


T>A 


39206 


Intron 1 


CCAGATGTATGTCCAAGCCTCTCT 
CTCTGTAAGGTAGGTCTATGGTTGCAA 


F AM -TGCAATTCTAATTAACCTC 
VIC-TGCAATTCTATTTAACCTC 


SG13S89 


rs4769874 


G>A 


53551 


Intron 3 


TCGGGAGGCCGTGTTTC 
CCAGGGAGCAAGCATTAGCA 


FAM-ATTATCACACGCGCTCT 
VIC-TATCACATGCGCTCTG 


SG13S32 


rs9551963 


A>C 


59657 


Intron 4 


CTGCTTTAGTTCTTGACCTCACCAA 
CTGGGGTTCAAGAGAGAAATTCC 


FAM - AAGG ATCTCATCT AGCAAT 
VIC-AAGGATCTCATCGAGCAA 


SG13S41 


rs93 15050 


A>G 


63155 


Intron 4 


CCTGTCTCCAAATACAGTCCCATT 
AGGTCCCTTCCAAAATTCATATGTT 


F AM - ATCTTT ACTCTCAGTTCCT 
V1C-TCTTTACCCTCAGTTCC 


SG13S35 




G>A 


67227 


Downstream 
of exon 5 


CCTGGCATTGAGGAGTTTTCC 
ACCCCACAAATACCTACAAATATGTGTAT 


FAM-TAAAAAACCGAAAGGAC 
VIC-TTAAAAAACTGAAAGGACC 



"Helgadottir et al. 1 

^NCBI SNP database (http://www.ncbi.nlm.nih.gov/entrez/); last accessed September 27, 2006. 
'NCBI nucleotide database (http://www.ncbi.nlm.nih.gov/entrez/); sequence version of May 18, 2005. 

rf FAM (6-carboxy-fluorescein).or VIC (proprietary dye of Applied Biosystems) was attached to the 5 ends of me probe oligonucleotides. The sequences of the P r °kes 
used for analysisof the SG13S25, SG13S377, SG13S106, and SG13S1 14 SNPs corresponded to the coding strand and the sequences of the probes used for analysis of 
the SG13S100, SG13S89, SG13S32, SG13S41, and SG13S35 SNPs corresponded to the noncoding strand; the allele -specific nucleotide in each probe sequence is 
underlined. 

SNPs, single nucleotide polymorphisms. 
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Table 2 

Baseline characteristics of the control group and the MI group 





Control group 


MI group 




Characteristic 


(n= 1211) 


(n = 3657) 


P 


Age,yr 


60.3 ± 11.9 


64.0 ± 12.0 


<0.0001 


Women 


598 (49.4) 


885 (24.2) 


<0.0001 


Arterial hypertension 


589 (48.6) 


2246 (61.4) 


<0.0001 


Hypercholesterolemia 


602 (49.7) 


2067 (56.5) 


<0.0001 


Current cigarette smoking 


184 (15.2) 


1849 (50.6) 


<0.0001 


Diabetes mellitus 


65 (5.4) 


754 (20.6) 


<0.0001 



Age is mean ± SD; other variables are presented as number (%). 
Ml, myocardial infarction. 



Germany). To accomplish allele-specific signaling, the probes 
contained the fluorogenic dyes 6-carboxy-fluorescein (FAM) 
or VIC. (proprietary dye of Applied Biosystems) . Minor groove 
binder groups were conjugated with the 3' ends of the oligo- 
nucleotides to facilitate formation of stable duplexes between 
the probes and their single- stranded DNA targets. 18 Approxi- 
mately 20% of the DNA samples were retyped with each Taq- 
Man system' to control for correct sample handling and data 
acquisition. The results of these repeat assays were in full agree- 
ment with the original genotyping results. 

Analyses of PCR products with allele-discriminating restric- 
tion enzymes and/or DNA sequencing were used to verify the 
accuracy of TaqMan genotyping. We employed the restriction 
enzymes Bgll (SG13S25 SNP), Haelll (SG13S377 SNP), Nsbl 
(SG13S100 SNP), Satl (SG13S106 SNP), TasI (SG13S114 



Table 3 



Genotype distributions and allele frequencies of ALOX5AP SNPs in the control group and the MI group 



deCODE SNP ID 


Genotype 


(1211 genotypes) 


MI group 
(3657 genotypes) 


P 


Allele 


Control group 
(2422 alleles) 


MI group 
(7314 alleles) 


P 


SG13S25 


GG 


963 (79.5) 


2949 (80.6) 


0.54 


G 


2162 (89.3) 


6579 (90.0) 


0.33 




GA 


236(19.5) 


681 (18.6) 




A 


260(10.7) 


735(10.0) 






AA 


12 (1.0) 


27 (0.7) 












SG13S377 


GG 


861 (71.1) 


2714(74.2) 


0.053 


G 


2047 (84.5) 


6285 (85.9) 


0.086 




GA 


325 (26.8) 


857 (23.4) 




A 


375 (15.5) 


1029(14.1) 






AA 


25 (2.1) 


86(2.4) 












SG13S100 


AA 


494 (40.8) 


1461 (40.0) 


0.11 


A 


1521 (62.8) 


4636 (63.4) 


0.60 




AG 


533 (44.0) 


1714 (46.9) 




G 


901 (37.2) 


2678 (36.6) 






GG 


184(15.2) 


482 (13.2) 












SG13S106 


GG 


568 (46.9) 


1697 (46.4) 


0,27 


G 


1644(67.9) 


4998 (68.3) 


0.68 




GA 


508 (41.9) 


1604 (43.9) 




A 


778 (32.1) 


2316 (31.7) 






AA 


135(11.1) 


356 (9.7). 












SG13S114 


TT 


526 (43.4) 


1591 (43.5) 


0.40 


T 


1586 (65.5) 


4842 (66.2) 


0.52 


TA 


534 (44.1) 


1660 (45.4) 




A 


836 (34.5) 


2472 (33.8) 






AA 


151(12.5) 


406(11.1) 












SG13S89 


GG 


1093 (90.3) 


3332 (91.1) 


0.60 


G 


2301 (95.0) 


6983 (95.5) 


0.34 




GA 


115(9.5) 


319(8.7) 




A 


121 (5.0) 


331 (4.5) 






AA 


3 (0.2) 


6 (0.2) 












SG13S32 


AA 


301 (24.9) 


924 (25.3) 


0.39 


A 


1224(50.5) 


3650 (49.9) 


0.59 




AC 


622 (51.4) 


1802 (49.3) 




C 


1198 (49.5) 


3664 (50.1) 






CC 


288 (23.8) 


931 (25.5) . 












SG13S41 


AA 


1047 (86.5) 


3166 (86.6) 


0.96 


A 


2253 (93.0) 


6810(93.1) 


0.88 




AG 


159(13.1) 


478 (13.1) 




G 


169 (7.0) 


504(6.9) 






GG 


5 (0.4) 


13 (0.4) 












SG13S35 


GG 


977 (80.7) 


3025 (82.7) 


0.24 


G 


2172 (89.7) 


6645 (90.9) 


0.086 




GA 


218(18.0) 


. 595 (16.3) 




A 


250 (10.3) . 


669 (9.1) 






AA 


16(1.3) 


37(1.0) 













Variables are presented as number (%) of genotypes or alleles in control individuals and myocardial 
SNPs, single nucleotide polymorphisms. 



infarction patients. 



February 2007 • Vol. 9 ■ No. 2 



125 



Co 



CoUeoe of Medic si 6 



eristics, Unauthorized reproduction or this article is prohibited. 



Koch et al. 



SNP), Xcel (SG13S89 SNP), TaqI (SG13S32 SNP), and BslLI 
(SG13S41 SNP) (MBI Fermentas). DNA sequencing was used 
to test whether one or more additional polymorphisms were 
present in the probe-binding section of the amplicons, because 
they may interfere with TaqMan reactions and result in wrong 
genotype assignments. With each SNP, 100 DNA samples were 
examined by sequencing. The known SNPs were identified as 
the only sequence variabilities in the probe-binding regions. 
Thus the probability of genotyping errors due to possible fur- 
ther sequence variations was relatively low. 

Clinicians responsible for diagnosis were not aware of the 
genetic data. All genetic analyses were blinded. 



Statistical analysis 

The analysis consisted of comparisons of genotype, allele, and 
haplotype frequencies between the control group and the group 
of patients with ML Because stronger associations of the HapA 
haplotype with MI were observed in men compared to women in 
both the Iceland and UK studies, 1 we also conducted separate 
analyses of SNP genotype distributions and HapA and HapB hap- 
lotype frequencies in the groups of men and women. Discrete 
variables are expressed as counts (%) and compared using the x 2 
test. Continuous variables are expressed as mean ± SD and com- 
pared by means of the unpaired, two-sided t test Haplotypes were 
reconstructed from genotype data with the use of the software 



Table 4 

Genotype distributions o(ALOX5AP SNPs in the women and men of the control group and the MI group 



Women 



Men 



deCODE SNP ID 


Genotype 


Control group 
(n = 598) 


MI croup 
(n = 885) 


P 


Control group 
(n = 613) 


MI group 
(n = 2772) 


P 


SG13S25 


GG 


465 (77.8%) 


716 (80.9%) 


0.13 


498(81.2%) 


2233 (80.6%) 


0.87 




GA 


127 (21.2%) 


166(18.8%) 




109(17.8%) 


515(18.6%) 






AA 


6(1.0%) 


3 (0.3%) 




6(1.0%) 


24(0.9%) 




SG13S377 


GG 


' 429(71.7%) 


659 (74.5%) 


0.37 . 


432 (70.5%) 


2055(74.1%) 


0.13 




GA 


157(26.3%) 


205 (23.2%) 




168(27.4%) 


652 (23.5%) - 






AA 


12 (2.0%) 


21 (2.4%) 




13 (2.1%) 


65 (2.3%) 


• 


SG13S100 


AA 


255 (42.6%) 


372 (42.0%) 


0.64 


239 (39.0%) 


1089 (39.3%) 


0.10 




AG 


261 (43.6%) 


404 (45.6%) 




272 (44.4%) 


1310(47.3%) 






GG 


82(13.7%) 


109(12.3%) 




102 (16.6%) 


373 (13.5%) 




SG13S106 


GG 


291 (48.7%) 


431 (48.7%) 


0.66 


277 (45.2%) 


1266(45.7%) 


0.27 




GA 


247(41.3%) • 


377 (42.6%) 




261 (42.6%) 


1227(44.3%) 






AA 


60(10.0%) 


77 (8.7%) 




75(12.2%) 


279(10.1%) 




SG13S114 


TT 


270 (45.2%) 


403 (45.5%) 


0.65 


256 (41.8%) 


1188(42.9%) 


0.53 




TA 


256 (42.8%) 


389 (44.0%) 




278 (45.4%) 


1271 (45.9%) 






AA 


72(12.0%) 


93(10.5%) 




79(12.9%) 


313(11.3%) 




SG13S89 


GG 


545 (91.1%) 


813(91.9%) 


0.84 


548 (89.4%) 


2519 (90.9%) 


0.37 




GA 


52 (8.7%) 


70 (7.9%) 




63(10.3%) 


249 (9.0%) 






AA 


1 (0.2%) . 


2(0.2%) 




2(0.3%) 


4 (0.1%) 




SG13S32 


AA 


147 (24.6%) 


221 (25.0%) 


0.51 


154 (25.1%) 


703 (25.4%) 


0.89 




AC 


315(52.7%) 


442 (49.9%) 




307 (50.1%) 


1360(49.1%) • 






CC 


136 (22.7%) 


222 (25.1%) 




152 (24.8%) 


709 (25.6%) 




SG13S41 


AA 


524 (87.6%) 


773 (87.3%) 


0.99 


523 (85.3%) 


2393 (86.3%) 


0.75 




AG 


72(12.0%) 


109(12.3%) 




87(14.2%) 


369(13.3%) 






GG 


2 (0.3%) 


3 (0.3%) 




3 (0.5%) 


10 (0.4%) 




SG13S35 


GG 


472 (78.9%) 


722 (81.6%) 


0.19 


505 (82^4%) 


2303 (83.1%) 


0.59 




GA 


114(19.1%) 


154(17.4%) 




104(17.0%) 


441 (15.9%) 






AA 


12 (2.0%) 


9(1.0%) 




4 (0.7%) 


28(1.0%) 





Variables are presented as number (%) of genotypes in control individuals and MI patients. 
SNPs, single nucleotide polymorphisms; MI, myocardial infarction. 
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package PHASE. 19 We tested for the independent association ef- 
fect of nine-marker haplotypes in multiple logistic regression 
models of MI that included as covariates age, gender, history of 
arterial hypertension, history of hypercholesterolemia, current 
cigarette smoking, and diabetes mellitus. Adjusted odds ratios and 
95% Wald confidence intervals were calculated based on these 
models. Statistical significance was set at P < 0.05. 

RESULTS 

The main baseline characteristics of the control group (n = 121 1) 
and the group of patients with MI (n = 3657) are shown in Table 2. 
Mean age of the MI patients was higher than that of the control 
group; the proportion of women was lower in the patient group 
than in the control group; and history of arterial hypertension and 
hypercholesterolemia, current cigarette smoking, and diabetes 
mellitus were more prevalent in the MI patient group than in the 
control group (P < 0.0001 for all comparisons; Table 2). 

Genotype distributions and allele frequencies of the ALOX5AP 
SNPs were not significantly different between the control group 
and patient group (Table 3). Significant sex-related differences of 
the genotype distributions were not found (Table 4). 

Figure 1 shows the linkage disequilibrium (LD) block struc- 
ture defined by the nine genotyped SNPs. Strong LD was 



ALOX5AP 

Exonl Exon2 Exon 3 Exon4 ExonS 
_| 1 1 1 fr- 



iable 5 




Fig. 1. Genetic diversity at the ALOX5AP genomic region located in the long arm of 
chromosome 13 (band q 12). The exon-intron structure was adapted from sequence data 
deposited in the NCB1 nucleotide database (http://www.ncbi.nlm.nih.gov/entrez/) under 
accession number AL5 1 2642, version of May 1 8, 2005. The values within squares show the 
pairwise correlations between single nucleotide polymorphisms (SNPs) (measured as D') 
defined at the top left and top right sides of the squares. Squares without a number 
indicate D' = 1.00. SNP designations: API = SG13S25, AP2 = SG13S377, AP3 « 
SG13S100, AP4 = SG13S106, APS - SG13S114, AP6 = SG13S89, AP7 = SG13S32, AP8 - 
SG13S41,AP9 = SG13S35. 





Control group 


MI group 


Haplotype 


(2422 haplotypes) 


• (7314 haplotypes) P 


HapA' 


359 (14.8) 


1171(16.0) 0.16 


HapB 


182(7.5) 


518(7.1) 0.48 



Haplotype frequencies are presented as number (%). The HapA haplotype is 
defined by the'alleles SG13S25-G; SG13S1 14-T, SG13S89-G, and SG13S32-A, 
and the HapB haplotype is defined by the alleles SG13S377-A, SG13S114-A, 
SG13S41-A, and SG13S35-G (Helgadottir et al. 1 ). 
MI, myocardial infarction. 



present across the ALOX5AP region (Fig. 1). Frequencies of 
the HapA (SG13S25-G, SG13S114-T, SG13S89-G, SG13S32-A) 
and HapB (SG13S377-A, SG13S114-A, SG13S41-A, SG13S35-G) 
haplotypes were not substantially different between the control 
group and the patient group (Table 5). Risk estimates were 1.10 
(95% CI: 0.96-1.25) for the HapA haplotype and 0.94 (95% 
CI: 0.79-1 .12) for the HapB haplotype. Haplotypes defined by 
nine SNPs were not present at significantly different propor- 
tions among the control individuals and patients, with the ex- 
ception of the Hap5 haplotype, which showed a moderately 
higher frequency in the control group than in the patient group 
(Table 6). 

The frequencies of the HapA, HapB, and nine-marker hap- 
lotypes were not significantly different between the women of 
the control and MI groups and between the men of the two 
groups. In addition, we did not observe significant differences 
in age or sex between the carriers and noncarriers of specific 
haplotypes in the control group. 

To assess whether independent associations existed between 
nine-marker haplotypes and MI, we performed a multivariate 
logistic regression analysis. After adjustments were made for 
conventional cardiovascular risk markers (age, gender, history 
of arterial hypertension, history of hypercholesterolemia, cur- 



Table 6 

Frequencies of nine-marker haplotypes in the control and MI groups 



Name 


Haplotype 

Allele combination 


Control group 
(2422 haplotypes) 


MI group 
(7314 haplotypes) 


P 


Hapl 


GGAGTGCAG 


928 (38.3) 


2805 (38.4) 


0.98 


Hap2 


GGAGTGAAG 


237 (9.8) 


808(11.0) 


0.082 


Hap3 


AGAGTGAAG 


253 (10.4) 


705 (9.6) 


0.25 


Hap4 


GGGAAGAAG 


204 (8.4) 


665 (9.1) 


0.32 


Hap5 


GAGAAGAAA 


188 (7.8) 


479 (6.5) 


0.041 




Other 


612 (25.3) 


1852 (25.3) 


0.96 



Haplotype frequencies are presented as number (%). Shown are results ob- 
tained from the five most frequent nine-marker haplotypes and the combined 
other nine-marker haplotypes. Each haplotype is defined as a specific allele 
combination based on nine single nucleotide polymorphisms (SNPs) in 
ALOX5AP. The order of the SNPs is as follows (from left to right): SG13S25, 
SG13S377, SG13S100, SG13S106, SG13S114, SG13S89, SG13S32, SG13S41, 
SG13S35. Overall P = 0.12. See Table 1 and Figure 1 for the locations of the 
SNPs in the ALOX5AP genomic region. 
Ml, myocardial infarction. 
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rent cigarette smoking, diabetes mellitus), the analysis. showed 
that none of the five most frequent nine-marker haplotypes, 
including the Hap5 haplotype, or the combined other haplo- 
types were significantly related with MI (P ^ 0.38). 

DISCUSSION 

The present data show that specific SNPs in ALOX5AP are 
not associated with MI in a large German population. Analyses 
in women and men did not reveal sex-specific relationships 
between the SNPs and MI. Specific four-marker haplotypes, 
the HapA and HapB haplotypes, and nine-marker haplotypes 
were not associated with MI. Most SNPs were in strong LD and 
the LD block structure was similar to those in other white 
populations. 4 * 14 Three of the nine SNPs examined here, the 
SG13S100, SG13S106, and SG13S1 14, were found to be signif- 
icantly associated with MI in a population from Iceland that 
consisted of 779 unrelated patients with MI and 624 popula- 
tion-based control individuals. 1 However, significant associa- 
tions between these SNPs and MI were not observed when 
adjustments were made for the number of markers tested. 1 
None of the three SNPs was associated with MI in the present 
population. The HapA haplotype was associated with a two- 
fold greater risk of MI in the Icelandic population (nominal 
P = 0.0000023; adjusted P = 0.005) but not in a sample from 
the UK (753 patients with MI and 730 control individuals). 1 In 
the same UK population, the HapB haplotype was associated 
with MI (nominal P = 0.00037; adjusted P = 0.046). 1 

The control subjects had some indication for coronary an- 
giography, and, therefore, they did not constitute a typical 
sample of healthy controls. We compared the frequencies of 
the HapA haplotype and the SNP alleles that define the HapA 
haplotype between the present control sample and an indepen- 
dent control group that consisted of 736 unrelated individuals 
from the KORAS2000 sample, a representative local popula- 
tion sample from southern Germany. 4 In the present control 
group and the control group from the KORAS2000 sample, 4 
the frequencies of the HapA haplotype were 14.8% versus 
15.2% (P = 0.74) and the frequencies of the SG13S25-G, 
SG13S114-T, SG13S89-G, and SG13S32-A alleles were 89.3% 
vs. 90.1% (P = 0.42), 65.5% vs. 65.0% (P = 0.77), 95.0% vs. 
96 0% (P = 0.15), and 50.5% vs. 49.7% (P = 0.62), respec- 



tively. Thus, with regard to the frequencies of the HapA hap- 
lotype and the alleles that constitute the HapA haplotype, the 
present control group is not substantially different from an 
established population-based sample. We inferred from this 
finding that the control sample with coronary angiography was 
suitable for the genetic association study described here. Mea- 
sures of inflammation were not examined, which is a limita- 
tion of the current study. 

Relationships of ALOX5AP SNPs and haplotypes with MI 
and ischemic stroke were evaluated in a nested case-control 
study within the Physicians' Health Study cohort that com- 
prised predominantly white (>94%) male US physicians. 14 * 20 
Investigation of 341 MI case-control pairs did not provide ev- 
idence of an association of any of the tested SNPs or the HapA 
or HapB haplotype with MI. 14 Genotype distributions and fre- 
quencies of SNP alleles and the HapA and HapB haplotypes in 
the case and control groups of the US sample 14 corresponded 
well with those of the present German sample. 

Similar to results obtained in Germans (this study) and US 
physicians, 14 the SNPs that define the HapA and HapB haplo- 
types were not associated with MI in a Japanese population 
that included 353 patients with MI and 1875 control 
individuals. 2 A meaningful association analysis of the HapA 
and HapB haplotypes was not possible in the sample from 
Japan because, with some of the SNPs, minor alleles were either 
absent or extremely rare. 2 Two-marker ALOX5AP haplotypes 
not related to the HapA and HapB haplotypes were associated 
with MI in the Japanese sample. 2 

Studies conducted with samples of white individuals pro- 
vided heterogeneous results about the relationship of the 
HapA and HapB haplotypes with MI (Table 7). Association of 
the HapA haplotype with MI was observed in a study sample 
from Iceland, but this finding was not replicated in samples 
from Germany (present study), the UK, and the US. 1 - 14 A rela- 
tionship of the HapB haplotype with MI was found in a study 
sample from the UK, but this result was not confirmed in sam- 
ples from Germany (present study) and the US. 1 * 14 Heteroge- 
neities of genetic and environmental factors across the source 
populations are unlikely to account for the inconsistencies. 
Genetic markers for proposed gene-disease associations may vary 
in frequency between populations, but there is empirical evidence 
that their biological impact on the risk of common diseases is 



Frequencies of theHapA and Hap B haplotypes and estimated risks of 

^ . population samples . . 



HapA 



HapB 



Study 



Controls/cases 



Risk 



Controls/cases 



Risk 



Germany (present) 
United States 14 
Iceland 1 

United Kingdom 1 



0.15/0.16 
0.14/0.17 
0.10/0.16 
0.15/0.17 



1.10 
1.18 
1.80 
n.s. 



0.16 
0.46 
<0.005 fl 



0.08/0.07 
0.07/0.06 

0.04/0.08 



0.94 
0.62 

No data available 
1.95 



Haplotype frequencies are presented as proportions of controls and cases; n.s. not significant (data not shown).' 
"Adjusted for the number of haplotypes tested. 1 



0.48 
0.08 

0.046° 
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usually consistent even across ethnic boundaries. 21 Consistent 
replication of genetic associations has been difficult to achieve, 
despite the biological plausibility of these associations. 22 In this 
context, the present findings argue against association of defined 
SNPs and haplotypes of ALOX5AP with MI. 
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* a 

Nonvalidation of Reported Genetic Risk 
Factors for Acute Coronary Syndrome 
in a Large-Scale Replication Study 



Context Given the numerous, yet inconsistent, reports of genetic variants being as- 
sociated with acute coronary syndromes (ACS), there is a need for comprehensive vali- 
dation of ACS susceptibility genotypes. 

Objective To perform an extensive validation of putative genetic risk factors for ACS. 
Design, Setting, and Participants Through a systematic literature search of articles 
published before March 10, 2005, we identified genetic variants previously reported as 
significant susceptibility factors for atherosclerosis or ACS. Restricting our analysis to white 
patients to reduce confounding from racial admixture, we identifed 811 patients who 
presented from March 2001 through June 2003 with ACS at 2 Kansas City, Mo, university- 
affiliated hospitals. During 2005-2006, we genotyped the 811 patients along with 650 
age- and sex-matched controls for 85 variants in 70 genes and attempted to replicate 
previously reported associations. We further explored possible associations without prior 
assumption of specific risk models and used the Sign test to search for weak associations. 

Main Outcome Measures Compare each prespecif ied gene variant associated with 
ACS risk among cases and controls. A surplus of associations would imply that some 
are associated with ACS. 

Results Of 85 variants tested, only 1 putative risk genotype (-455 promoter variant 
in p-fibrinogen) was nominally statistically significant (P= .03). Only 4 additional genes 
were positive in model-free analysis. Neither number of associations was more fre- 
quent than expected by chance, given the number of comparisons. Finally, only 41 of 

84 predefined risk variants were even marginally more frequent in cases than in con- 
trols (with 1 tie), representing a 48.8% "win rate" (95% confidence interval, 38.1 %- 
59.5%) for the collective risk genotypes (P=.91 , Sign test). 

Conclusions Our null results provide no support for the hypothesis that any of the 

85 genetic variants tested is a susceptibility factor for ACS. These results emphasize 
the need for robust replication of putative genetic risk factors before their introduc- 
tion into clinical care. 

JAMA. 2007;297:1551r1561 www.jama.com 
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COMPELLING EVIDENCE FROM 
twin and epidemiological 
studies suggests a genetic ba- 
sis for atherosclerotic heart 
disease and acute coronary syn- 
dromes (ACS), including unstable an- 
gina, non-ST-elevation myocardial in- 
farction (NSTEM1), and ST-elevation 
myocardial infarction (STEM1). 1 ' 2 To 
date, numerous candidate genes have 
been implicated, mainly by case- 
control studies, as potential cardiovas- 
cular risk factors, but few, if any, have 
been established definitively. 3 ' 5 Fac- 
tors undermining the validity of pre- 
vious reports include inappropriately 
small sample sizes, multiple subgroup 
comparisons, and publication bias. 4 

Before use in clinical care, potential 
genetic risk factors would ideally be 
replicated en masse in large, well- 
characterized patient populations. 6 To 
date, no such comprehensive valida- 
tion of genetic variants potentially as- 
sociated with ACS or atherosclerosis has 
been reported. 

Accordingly, we first sought to 
identify genetic associations with ACS 
by systematically searching the medi- 
cal literature for variants reported in 
association with Ml, unstable angina, 
or atherosclerosis. We then attempted 
to validate these putative genetic risks 
in a large case-control study. 



METHODS 
Candidate Genes 

We searched PubMed and bibliogra- 
phies of original and review articles 
for manuscripts published before 
March 10, 2005, that reported statisti- 
cally significant associations between 
specific genotypes and coronary ath- 
erosclerosis or ACS (A list of the 
articles is available on request from 
the authors). MEDLINE search terms 
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GENETIC RISK FACTORS FOR ACUTE CORONARY SYNDROMES 



included: gene, genetic, polymorphism, 
myocardial infarction, atherosclerosis, 
coronary heart disease, and coronary 
artery disease. Reports were included 
if they contained a claim of a signifi- 
cant positive association, with an 
investigator-reported P value <.05. A 
total of 96 polymorphic genetic vari- 
ants in 75 genes were identified and 
included (Table 1 and Table 2). 
Eleven of those were excluded because 
they had failed the multiplex genotyp- 
ing assay. 

Description of Cases and Controls 

Eight hundred eleven white patients of 
European ancestry with ACS were 
identified from a consecutive series of 
patients presenting at 2 Kansas City, 
Mo, hospitals (Mid-America Heart 
Institute and Truman Medical Cen- 
ter),. from March 2001 through June 
2003. Standard definitions were used 
to diagnose ACS patients with either 
MI or unstable angina. 92,93 Myocardial 
infarction was defined by a positive 
troponin blood test in the setting of 
symptoms and electrocardiogram 
changes (both ST-segment elevation 
and non-ST-segment elevation 
changes) consistent with Ml. Unstable 
angina diagnoses were confirmed, by 
concurrence of 3 physician chart 
reviewers, if patients had negative tro- 
ponin blood tests and any one of the 
following: new onset angina (<2 
months) of at least Canadian Cardio- 
vascular Society Classification class 111, 
prolonged (>20 minutes) rest angina, 
recent (<2 months) worsening of 
angina, or angina that occurred within 
2 weeks of an Ml. 93 Of the troponin- 
negative unstable angina patients, 203 
(92.7%) had a cardiac catheterization, 
a nuclear stress test, or a stress echo- 
cardiogram to corroborate their 
diagnoses. 

Each participating inpatient with 
ACS was interviewed to determine vari- 
ables, such as smoking, alcohol use, 
family history (>1 first-degree rela- 
tives with MI or coronary artery dis- 
ease), and to obtain consent for a blood 
sample for genetic analysis. In addi- 
tion, detailed chart abstractions were 



performed to collect relevant labora- 
tory and clinical data. 

A total of 1045 ACS patients (of 
which 81 1 white patients were included 
in the current study) agreed to partici- 
pate and to provide a blood sample for 
genetic analysis. Patients self-reported 
their race/ethnicity by selecting one of 
the following descriptors that were pro- 
vided by the investigators: white, white 
Hispanic, African American, and Afri- 
can American non-Hispanic. Age- and 
sex-matched controls were recruited 
from the ambulatory outpatient clini- 
cal laboratory of 1 of the centers, Saint 
Luke's Hospital of Kansas City. These 
patients were undergoing routine labo- 
ratory testing and were asked to com- 
plete a medical questionnaire defining 
cardiac risk factors and medical co- 
morbidities. Those controls reporting 
a previous ACS, prior coronary artery 
bypass graft surgery or prior percuta- 
neous coronary intervention were 
excluded. To minimize the potential 
impact of genetic admixture, 650 white 
controls of mixed European ancestry 
who reported no history of coronary 
artery disease were selected from among 
the 1054 potential controls. Risk fac- 
tor data were missing for 9 sex-, age-, 
and race-matched unaffected con- 
trols, and 56 additional matched 
controls were used for ALOX5AP 
haplotyping. 

The research protocol was ap- 
proved by the institutional review 
boards of both institutions; all study 
participants provided written in- 
formed consent for clinical and ge- 
netic studies. 

Genotyping 

Genomic DNA was isolated (Gentra 
PUREGENE, Minneapolis, Minn) from 
blood samples and subjected to 
whole genome amplification by mul- 
tiple-strand displacement (Molecular 
Staging lnc, New Haven, Conn), using 
random priming and Phi-29 polymer- 
ase. 94,95 Genotyping was performed 
using the Sequenom MALD1-TOF 
(Matrix Assisted Laser Desorption- 
lonization Time-of-Flight) system, 
using Spectrodesign software for as- 



say design (Sequenom, San Diego, 
Calif), and assay methods that have pre- 
viously been described. 96,97 Gene vari- 
ants were excluded from analysis if they 
could not be genotyped using the Se- 
quenom system due to persistent as- 
say failure, defined as less than 95% 
scorable genotypes after 4 multiplex re- 
action design cycles. Eleven assays were 
ultimately excluded.* For the rare 
MEF2A 21-base pair (bp) deletion, 
cases and controls were genotyped by 
polymerase chain reaction to generate 
amplicons of 152-bp nondeletion or 
131-bp deletion followed by electro- 
phoresis on 3% agarose gels. Identi- 
fied deletions were confirmed by di- 
rect DNA sequencing. Due to its rarity, 
MEF2A was analyzed separately, and 
thus only the other 84 genes were sub- 
jected to the full set of statistical analy- 
ses. PHASE Version 2. 1 was used to es- 
timate haplotype frequencies for 
ALOX5AP™' 102 

Statistical Analysis 

Genotype distributions in cases and 
controls were examined for signifi- 
cant deviation (P<.05) from Hardy- 
Weinberg equilibrium. The number 
of departures was assessed by Monte 
Carlo simulation and compared with 
the number expected by chance 
alone (Resampling Stats lnc, College 
Park,Md). 

In the primary analysis, each ge- 
. netic variant was prespecified based on 
published reports, and the frequen- 
cies of risk-associated variants were 
compared in cases and controls by using 
a 100 000 iteration Monte Carlo exten- 
sion of the x 2 test (SPSS 13.0 Exact 
Tests, SPSS lnc, Chicago, 111). The term 
statistically significant was reserved 
for a P value below the Bonferroni- 
corrected study-wide significance 
threshold (0.05/84=0.0006). Because 
the Bonferroni correction is conserva- 
tive when applied to a replication study, 
the total number of all positive asso- 
ciations at the P<.05 level was also 
compared with the expected number by 
chance in 100 000 simulations. A 

♦References 13, 20, 42, 45, 72, 88, 98-100. 
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Table 1. Validation of Predefined Risk Genotype Comparisons in Cases vs Controls 



Gene Symbol 


Variant 


Genotype 

f* r\m rt a ri c n 

wumporiouii 


Risk Variant 
Control Frerjuencv 


Odds Ratio 
(95% CI) 


2-Tailed 
P Value 


Caenotype Frequency 
Difference 


ABCA1 


a -7-7 /~[~7,$ 

-477C/ 1 • 


I I V5 \s 1 Ul v-/l«y 


0.222 


1.13(0.88-1.45) 


.35 


0.0215 


ABCA1 


Lys2l9Arg a 


• A VS G 


0.274 


1.02(0.87-1.20) 


.83 


0.0041 


ACE1 


indel 4 


PlO wo Pll r\r II 

uu vs ui or ii 


0.286 


1 07 (0 85-1 35) ' 


.60 


0.0139 


ADD1 


Gly460Trp 11,l4: 


I VS La 


n 1QQ 

U. I 5753 


1 09 (0 91-1 32) 


.35 


0.0148 


ADRB2§ 


• Glu27Gln 1J 


G vs C 


n AAA 


ft Q^ f0 RO-1 08) 


.34 


. -0.0186 


ADRB2 


He164Thr 13 


CC vs CT 


a qpq 
u.yoy 


ft 47 /ft 90-1 13} 


.10 


-0.0117 


ADRB2 


Gly16Arg 13 


A vs G 


O Q77 


1 n4 /ft rq-1 90^ 


.67 


0.0082 


ADRB3 


Arg64Trp 14 


C vsT 


A A77 


ft QQ /ft 7R-1 ^1} 

U.yy \U. I O I.Oij 


>.99 


-0.0004 


AGT 


Thr235Met 15 


T vs C 


a c;7o 


1 C\A Id 


.65 


0.0086 


AGTR1 


A1166C 16 


CC vs CA or AA 


a oqo 


1 nfl (n 7R-1 


.72 


0.0063 


ALOX5AP 


HAP B 17 


HAP B vs non-B 


0 neo 
U.uoz 


1 10 fO, QH-1 AC\\ 

1 . 1 £ 1 '^UJ 


.31 


0.0120 


ALOX5AP 


HAP A 17 


HAP A vs non-A 




0 on ft") 7f%_l 10} 


.32 


-0.0130 


APOA1 


C83T 18 


T vs C 


n 1 7Q 
U. 1 f 0 


ft QQ (0 81-1 20) 

U.C7v7 \U.O 1 1 .C\J) 


.92 


-0.0019 


APOA1 


-75G/A 19 


A vs G 


U.UU4 


1 QS fft RR-S S4) 


.23 


0.0037 


APOE 


ArglSSCys 20 


CC vs CT or TT 


O AOQ 


I .O I "O.Ufc/ 


.17 


0.0134 


APOE 


-219T/G 21 


T vs G 


O »7K 




.08 


0.0329 


BDKRB2 


-58C/T 22 


C vs T 


a «^qa 
u.oyu. 


O. Q^ fft ftft-1 ftfl} 


.37 


-0.0169 


cam 


Thr23Ala 23 


C vsT 


A QQ7 

U.ooY 


ft Q7 rt^l 1QV 

VJ.I7I \VJ.OU 1.157/ 


.80 


-0.0036 


CCR2 


Val64lle 24 


GG vs AG or AA 


O Q CC 

U.000 


n Q4 7n-i 0^ • 


.71 


-0.0082 


CCR5 


Indel 25 


I vs D 


0.907 




.05 


-0.0224 


CD14 


-1 59C/T 26 


TT vs CT or CC 


A OQ7 


1 1 n /ft ftft-1 ^Q} 

I . I U \U.OD I .053^ 


.46 


0.0170 


CETP 


intronl G/A 27 


G vs A 


0.568 


n qq /n Qi-1 Oft\ 

u.yo \u.c5i- 1 .voj 


^7 

.Of 


-0.0169 


CETP 


-629C/A 28 


C vs A 


0.505 


O OR /O QQ 1 i 

u.yo (u. 00-1 .1 ij 


ftft 
.uu . 


-0.0104 


COMTt 


VaUSSMet 29 


GG or AG vs AA 


0.222 


1 1 1 /A G7 1 AQ\ 


• 41 


0.0193 


CX3CR1 


He249VaP°" 31 


C vsT 


0.729 


A Oft /n Q1 11 Q\ 
u.yo \U.O l - 1 . 1 OJ 


.62 


-0 0088 


CX3CR1 


Thr280Met 30 . 


G vs A 


0.831 


1 Oft /A P.7-1 9Q^ 

i .uo ^u.o^ - 1 .^yj 


• UO 


0 0080 


CYP1 1B2\ 


-344T/C 32 ' 33 


C vsT 




1 .uy ^u.yH- 1 .£.0) 


27 


0.0204 


CYP2C9 


LeuSSgile 34 - 35 


AC vs AA 


0.094 


A 7ft /A C"0 1 101 


.17 


-0.0208 


CYP2C9*t 


Cysl^Arg 34 - 35 


CC vs CT 


uyyo 


1 Oi /A 77 1 Q9\ 
I .U I \U. f / - 1 .O^) 


.95 


0.0019 


ENPP1 


Gln^lLys 36 


C vsA 


0.130 


i 07 /A QC 1 QO\ 


RQ 

.oy 


0.0076 


ESR1 


-401T/C 37,58 


TT vs CT or CC 


0^285 


1 Aft rt"l QQ^ 

1 .uo iu.o*t- 1 .OO) 


.64 


0.0118 


F12 


46C/F 9 


TT vs CT or CC 


0.067 


A OO /A CA 1 >1A\ 

u.y<^ \u.0u-1 .^u; 


7R 
. 1 u 


-0.0051 


F13A1 


Val34Leu 40 


G vsT 


0.75o 


1 AO /A Oft 1 OO^ 


.79 


0.0045 


F2 


G20210A 4041 


A vs G 


0 rn 7 
U.U l / 


A QO /A K.O.1 

u.y^ \u.o*:- 1 .o*+; 


.88 


-0.0013 


F5 


Arg506Gln 40 


. A vs G 


0.025 


u.yo ^u.oo- 1 .au/ 


.81 


-0.0016 


F7t 


Arg353Gln 42 


G vs A 


u.oyo 


A QA /A 74-1 1Q\ 


.63 


-0.0062 


FGB 


-455A/G 43 


GG or AG vs AA 


a £A7 


1 07 M ft^-1 'Sft) 


.03 


0.0558 


GJA4 


C1019T 44 


T vs C 


0 qi 0 


n Qn /ft 77-1 ftfil 
u.yu \u./ / i .uu/ 


.22 


-0.021 5 


GP1BA 


-5T/C 44 46 


T vs C 




ft Qd /ft 7S-1 17) 

U.53*t \\J. 1 \J I . 1 f / 


.57 


-0.0071 


GRL 


AsnSo^Ser 47 


A(j> VS AA 


a 07Q 


ft 7Q /ft RO-1 1Q) 


.29 


-0.0146 


HFE 


Cys282Tyr 46 


A vs G 


A OCR. 
U.UOO 


ft Qft /ft 7V1 *^P) 


.94 


-0.0010 


HTR2A* 


Ser102Ser 4 ^ 


1 I vs CI or 


n mft 

U. I DO 


1 1? /0 R4-1 48) 


.47 


0.0154 


ICAM1 


Lys469Glu 50 


A vs G 


n ^R7 


1 ftQ (ft Q4-1 27) 
1 .uy \U.57H 1 .t / / 


.26 


0.0214 


IL1B 


-511C/T* 1 


CC vs CT or TT 


0.445 


1.14(0.92-1.41) 


.24 


0.0328 


IL6 


-174G/C 52 ' 53 


CvsG 


0.403 


1.05(0.91-1.22) 


.50 


0.0127 


IRS1 


Arg97lGly 64 


AvsG 


0.059 


0.96 (0.70-1.32) 


.81 


-0.0023 


ITGA2 


Phe807Phe 5556 


. AvsG 


0.391 


1.03(0.88-1.19) 


.73 


0.0063 


ITGB3 


Leu33Pro 4 


CvsT 


0.164 


0.85 (0.70-1.05) 


• 14. 


-0.0203 


UPC 


. -514T/C 5758 


TvsC 


0.240 


0.85 (0.71-1.01) 


.07 


-0.0286 


LPA 


AspgAsn 50 


AG vs GG 


0.026 


1.43 (0.78-2.63) 


.29 


0.0107 


LRP1 


1Tir3261"mr ao 


GG or AG vs AA 


0.107 


1.02 (0.73-1.42) 


.93 


0.0018 
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Table 1. Validation of Predefined Risk Genotype Comparisons in Cases vs Controls (cont) 



Gene Symbol 


Variant 


Genotype 
Comparison 


Risk Variant 
Control Frequency 


Odds Ratio 


2-Taited 
P Value 


Genotype Frequency 
Difference 


LTA 


A252G 6162 


GG vs AG or AA 


0.1 16 


U.OO lU.O^- l -ex)) 


ATI 


-0.0147 


LTA 


Thr26Asn 61,62 


AA vs AC or CC 


0.119 


U.o^ (U.oy-l.l^; 


07 


-0.0191 


MGP 


Thr^Ma 63 


G vs A 


0.385 


l.UU ^U.oO-l. ID) 


QQ 


0 0000 


MGP 




Gvs A 


0.636 


I .UU \U.OO- 1 ;\\>) 


Q7 
.01 


-0.001 0 


MMP3 


indel 64 - 65 


DD vs Dl or II 


0.284 


f\ oil /n cc 1 nc\ 
Q.8*?(U.bO-l.UO) 


.lO 




MTHFR* 


Ala222Val 86 


TT vs CT or CC 


0.100 




m 

. iU 


0.0283 


MTP 


-493G/T 57,68 


TvsG 


0.255 


.1.01 (U.oO-1 .<Z\J) 


.00 


\J.\J\JC.O 


MTR 


Asp919Gly 69 


A vs G 


0.804 


-t r\0 in OK 1 Ov1\ 

1.0v3 (U.oO-1 .ZA) 


7Q 




NPPA 


Ter29ArgArg 70 


CC vs CT or TT 


0.023 


a on tr\ o qo\ 
1 .2\j (U.bl 






OLR1 


Lys167Asn 71 


C vsG 


0.916 


r\ nn /A CO 1 

\J.o2. (U.oo-1 .Uo) 


1 0 

.it 




p22-PHOX% 


His72Tyr 72 - 73 


CC vs CT or TT 


0.337 


i . io iu.yi -1 .oyj 


9ft 


n 0?QQ 


PAI1 


Indel 43 - 44 


DD vs Dl or II 


0.309 


1.00 (0.80-1 .do) 


>.yy 


n norm 


PECAM1 


Leu125Val 74 - 75 


GG vs CG or CC 


0.288 


0.94 (0.75-1 .19; 


CSX 


— u.u i<;u 


PECAM1 


Ser563Asn 74 - 75 


AvsG 


0.498 


0.97 (0.84-1 .13) 


-j-i 
.11 


__n nn79 


PON1 


Glnl92Arg 76 


AvsG 


0.705 


1.01 (0.86-1.18) 


.97 


U.UUIU 


PON2 


Cys311Ser 77 


CC vs CG or GG 


0.556 


1.10 (0.89-1.35) 


.40 


- ri no9°. 


PPARG 


Ala12Pro 78 


CvsG 


0.129 


0.83 (0.66-1.03) 


.10 


— u.u^uu 


PTGS2 


-765G/C 79 


CvsG 


0.168 


0.85 (0.69-1 .04) 


.11 


-o.Uiiiy 


RECQL2 


Arg1367Cys 80 


TvsC 


0.758 


0.80 (0.68-0,94) 


.01 


-U.U4oo 


SELE 


Leu554Phe 75 


TvsC 


0.036 


1.17 (0.80-1.71) 


.44 


0.0059 


SELE 


Ser128Arg 75 


C vs A 


0.103 


0.91 (0.71-1.16) 


.45 


-0.0086 


SELP 


Thr715Pro fl1 


AvsC 


0.902 


0.93 (0.73-1.18) 


.58 


-U.LaJDO 


TFPi 


Va^N/let 82 


AG vs GG 


0.050 


0.80 (0.49-1,32) 


.44 


-0.0096 


THBD 


-33G/A 83 


AG vs GG 


0.002 


2.39 (0.25-23.0) 


.63 




THBD 


Ala2o l nr^.. 


Ab VS 




1 ?Q (0 50-3 35) 


.64 


0.0030 


THBD 


Ala455Val 85 


CC vs CT or TT 


0.659 


1.05(0:85-1.31) 


.65 


0.0114 


THBS1 


Asn700Ser 86 


GG vs AG or AA 


0.021 


0.81 (0.38-1.72) 


.70 


-0.0039 


THBS2t 


3'UTR T/G 87 


TT or GT vs GG 


0.950 


0.51 (0.34-0.79) 


.002 


-0.0432 


THBS4 


Ala387Pro 87 


GG or CG vs CC 


0.939 


1.00(0.65-1.53) 


>.99 


-0.0002 


THPO 


A5713G 88 


GG vs AG or AA 


0.264 


1.20(0.95-1.51) 


.13 


0.0368 


TLR4 


Gly299Asp 89 


AvsG 


■ 0.942 


.1.02(0.75-1.40) 


.94 


0.0011 


TNF 


-308G/A 00 


AvsG 


0.158 


0.86(0.70-1.06) 


.17 


-0.0188 


TNFRSF1A 


Arg92Gln 91 


AG vs GG 


0.050 


0.80(0.49-1.32) 


.44 


-0.0096 



♦Hardy-Weinberg 
fHardy-Weinberg 
*P<.001 (n = 2), 
§P<.001 (n=1). 



equilibrium deviation in 
equilibrium deviation in 



controls, P<.05(n = 3). 
cases, P<.05 (n = 5). 



surplus of positive associations over 
random expectations would imply that 
some are truly associated with ACS. 

Secondarily, we also compared the 
overall genotype distributions at each 
locus in cases and controls by Monte 
Carlo x 2 testing. Power to confirm in- 
dividual genetic associations was de- 
termined using a log-likelihood-based 
method (Quanto 1.0). 103 ' 104 

Finally, as a measure to increase power, 
the observed proportion of prespecified 
risk variants found to be even margin- 
ally more frequent in cases than in con- 
trols was assessed by the Sign test. Under 



the null hypothesis, each of the risk vari- 
ants is equally likely to be more fre- 
quent in cases, or in controls. To esti- 
mate the Sign test's power to detect an 
excess of even weakly positive genetic 
associations (50 of 84 positive associa- 
tions confers P= .05 in the Sign test), we 
simulated the resampling of 650 con- 
trol and 811 case genotypes across 84 
genetic comparisons, finding the mini- 
mum detectable odds ratio ensuring a 
critical probability level of a 63.3% win 
rate for each 84 risk variants that pro- 
vides 80% confidence of having at least 
50 wins. 



RESULTS 

The clinical characteristics of the 811 
cases and 650 controls are described in 
Table 3 and the distributions of their 
genotypes are shown in Table 2. The 
population of ACS cases included 308 
(38%) STEM1, 284 (35%) NSTEM1, and 
219 (27%) unstable angina patients. 
Cases and controls had similar age, sex, 
and body mass index distributions. A 
family history of coronary artery dis- 
ease or Ml among first-degree relatives 
was 2.7-fold higher in male cases than 
in male controls and 2.0-fold higher in 
female cases than in female controls. 
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I P Values in Cases With Acute Coronary Syndrome and Controls 



1 M. . VJ 


V.I IV Ljf 1 > V. V 




No. (%) 


2-Tailed 
P Value 








No. (%) 

, : — 1 O 


.Tailorl 


Gene 


Variant 


Genotype 


I 

Cases 


I 

Controls 


Gene 


Variant 


Genotype 


Cases Controls P Value 








191 (24.6) 


182 (28.3) 








CC 


539(68.2) 456(70.0) 




ABCA1 


-477C/T 


on 


396(51.0) 


319(49.5) 


.27 


CCL11 


Thr23Ala1: 


CT 


239(30.3) 178(27.3) 


.20 


n 


189(24.4)' 


143(22.2) 








TT 


12(1.5) 17(2.6) 








AA 


65(8.2) 


46(7.1) 








AA 


7(0.9) 6(0.9) 




ABCA1 


Lys219Arg 


AG 


311 (39.3) 


263 (40.6) 


pn 

.Do 


CCR2 


Val64lle 


AG 


116(14.4) 89(13.6) 


.91 


GG 


416(52.5) 


338(52.2) 








GG 


681 (84.7) 561 (85.5) 








DD 


233 (30.0) 


185(28.6) 








II 


631 (78.4) 540(82.4) 




ACE1 


l/D 


Dl 


389(50.1) 


329 (50.9) 


OA 


CCR5 


Indel 


ID 


162(20.1) 108(16.5) 


.15 


II 


154(19.8) 


132 (20.4) 








DD 


12(1.5) 7(1.1) 








GG 


456(60.7) 


419(64.9) 








CC 


204(25.4) 193(29.5) 




ADD1 


Gly460Trp 


GT 


269(35.8) 


197 (30.5) 


.Uo 


CD14 


-159C/T 


CT 


395(49.2) . 306(46.8) 


.22 


TT 


26(3.5) 


30(4.6) 








TT 


204(25.4) 155(23.7) 








CC 


266(34.5) 


217(34.8) 








AA 


168(20.9) 135(20.6) 




ADRB2 


Gtu27Glnt 


CG 


358(46.5) 


264 (42.3) 


14. 
. I *+ 


CETP 


intronl G/A 


AG 


387(48.1) 297(45.3) 


.44 


GG 


146(19.0) 


143(22.9) 








GG 


250(31.1) 224(34.1) 








CC 


789(97.8) 


652 (98.9) 








AA 


205(25.6) 171(26.4) 




ADRB2 


He164Thr 


CT 


18(2.2) 


7(1.1) 


. I 1 


CETP 


-629C/A 


AC 


400(49.9) 298(46.1) 


.31 


TT 


0 


0 








CC 


197(24.6) 178(27.5) 








AA 


128(16.3) 


100(15.2) 








AA 


231 (29.8) 181 (28.1) 




ADRB2 


Gty16Arg 


AG 


348 (44.3) 


294(44.8) 


.Of 




VOi 1 OOIVICI+ 


AG 


358 (46.1) 321 (49.8) 


.39 


GG 


309(39.4) 


262 (39.9) 








GG 


187(24.1) 143(22.2) 








CC 


6(0.7) 


1 (0.2) 








CC 


410(51.1) 353(53.6) 




ADRB3 


Arg64Trp 


CT 


111(13.8) 


99(15.1) 


.21 






CT 


336(41.9) 254(38.6) 


.43 


TT 


687 (85.4) 


557 (84.8) 








TT 


56(7.0) 51(7.8) 








CC 


143(17.8) 


107 (16.5) 








AA 


18(2.2) 13(2.0) 




AGT 


TTir235Met 


CT 


387 (48.3) 


340 (52.6). 






ThrPRDMpt 


AG 


223(27.7) 195(29.8) 


.65 


TT 


272 (33.9) 


. 200 (30.9) 








GG 


565(70.1) 447(68.2) 








AA 


388(48.1) 


332 (50.8) 








CC 


163(20.6) 109(16.6) 




AGTR1 


A1166C 


AC 


339(42.1) 


262 (40.1) 


.61 


PVD1 1 £29 




CT 


352(44.6) 319(48.6) 


.12 


CC 


79 (9.8) 


60 (9.2) 








TT 


275(34.8) 229(34.9) 








B 


50(6.5) 


41 (5.8) 








AA 


708 (92.7) 568 (90.6) 




ALOX5AP 


HAP B 


non-B 


734 (93.5) 


661 (94.2) 


.Ol 




l_t3lJO>Jv7llG 


AC 


56(7.3) 59(9.4) 


.17 




NA 


NA 








CC 


0 0 








A 


124 (15.9) 


122 (17.4) 








CC 


589(80.0) 491(79.8) 




ALOX5AP 


HAP A 


non-A 


654(84.1) 


584 (82.6) 


.00 




Pv<;144Ara*± 


CT 


147(20.0) 124(20.2) 


.95 




NA 


NA 








TT 


0 0 








AA 


23(3.0) 


25 (3.8) 








AA 


600(74.3) 498(75.7) 




APOA1 


C83T 


AG 


219(28.3) 


175 (26.9) 






filnl 21 Lvs 


AC 


192(23.8) 149(22.6) 


.84 


GG 


532 (68.7) 


450 (69.2) 








CC 


15(1.9) 11(1.7) 








AA 


1 (0.1) 


0 








CC 


145(18.0) 143(21.8) 




APOA1 


-75G/A 


AG 


10(1.3) 


5 (0.8) 


.0 1 


ESR1 


-401 T/C 


CT 


421 (52.3) 326(49.7) 


.20 


GG 


784 (98.6) 


coo /no d\ 

boo (yy.^j 








TT 


239(29.7). 187(28.5) 








CC 


29(3.6) 


15(2.3) 








CC 


459 (58.0) 371 (56.6) 




APOE 


Arg158Cys 


CT 


209 (26.1) 


154(23.5) 


.14 


F12 


46C/T 


CT 


283 (35.8) 241 (36.7) 


.84 


TT 


562 (70.3) 


487(74.2) 








TT 


49(6.2) 44(6.7) 








GG 


194 (24.2) 


177 (27.3) 








GG 


443 (56.8) 354 (55.8) 




APOE 


-219T/G 


GT 


403 (50.2) 


327 (50.5) 


.21 


F13A1 


Val34Leu 


GT 


296(37.9) 247(39.0) 


.93 


TT . 


206 (25.7) 


144(22.2) 








TT 


41(5.3) 33(5.2) 








CC 


263(32.8) 


221 (33.6) 








AA 


1 (0.1) 1 (0.2) 




BDKRB2 


-58C/T 


CT 


394(49.1) 


335 (50.9) 


.43 


F2 


G20210A 


AG 


23(2.9) 20(3.0) 


.94 


TT 


145 (18.1) 


102 (15.5) 








GG 


783(97.0) 635(96.8) 
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Table 2. Genotype Frequencies and P Values in Cases With Acute Coronary Syndrome and Controls (cont) 









No. (%) 


2-Tailed 
P Value 


Gene 


Variant 


Genotype 


Cases 


I 

Controls 






AA 


1 (0.1) 


1 (0-2) 




F5 


Arg506Gln 


AG 


36 (4.5) 


31 (4.7) 


.95 






GG . 


769 (95.4) 


623 (95.1) 








AA 


16(2.0} 


6 (0.9) 




F7 


Arg353Gln* 


AG 


148(18.7) 


129 (19.6) 


.22 






GG 


629 (79.3) 


522 (79.5) 








AA 


24 (3.0) 


26 (4.0) 




FGB 


-455A/G 


AG 


247 (30.7) 


229 (35.3) 


.08 . 






GG 


533(66.3) 


394 (60.7) 








CC 


401 (50.6) 


311 (47.5) 




GJA4 


C1019T 


CT 


313(39.5) 


272(41.5) 


. .47 






rr 


78(9.8) 


72 (11.0) 








cc 


13(1.7) 


8(1.2) 




GPlBA 


-5T/C 


CT 


168(21.6) 


138(21.1) 


. .75 






TT 


597.(76.7) 


509 (77.7) 








AA 


756(94.1) 


608 (92.7) 




GRL 


Asn363Ser 


AG 


47 (5.9) 


48 (7.3) 


.29 






GG 


0 


0 








AA 


3 (0.4) 


1 (0.2) 




HFE 


Cys282Tyr 


AG 


96(12.0) 


83(12.6) 


.70 






GG 


703 (87.7) 


574 (87.2) 








CC 


286(36.5) 


275 (42.0) 




HTR2A 


Ser102Ser* 


CT 


363(46.4) 


278 (42.4) 


.11 






TT 


134(17.1) 


102(15.6) 








AA 


270(34.0) 


195(30.2) 




/GAM 7 


Lys469Glu 


AG. 


379(47.7) 


329(51.0) 


.30 






GG 


145.(18.3) 


121 (18.8) 








CC 


359 (47.7) 


289 (44.5) 




MB 


-511 C/T 


CT 


311 (41.4) 


292 (44.9) 


.40 






TT 


82(10.9) 


69 (10.6) 








CC 


142 (17.6) 


106(16.1) 




IL6 


-174G/C 


CG 


386(48.0) 


319(48.5) 


.73 






GG 


277 (34.4) 


" 233 (35.4) 








AA 


3(0.4) 


3 (0.5) 




IRSl 


Arg971Gly 


AG 


84(10.6) 


69 (10.9) 


.98 






GG 


704 (89.0) 


562 (88.6) 








AA 


123 (15.3) 


108(16.4) 




ITGA2 


Phe807Phe 


AG 


394 (48.9) 


298 (45.4) 


.40 






GG 


288 (35.8) 


251 (38.2) 








CC 


20 (2.5) 


14(2.2) 




ITGB3 


Leu33Pro 


CT 


188(23.6) 


182(28.3) 


.12 






TT 


588(73.9) 


446 (69.5) 








CC 


506 (62.9) 


369 (56.8) 




UPC 


-514T/C 


CT 


256(31.8) 


250 (38.5) 


.03 






TT 


42 (5.2) 


31 (4.8) 








AG 


29(3.7) 


17(2.6) 




LPA 


Asp9Asn 


GG 


765 (96.3) 


641 (97.4) 


.29 






GG 


0 


0 








AA 


367 (46.9) 


283 (43.2) 




LRPl 


Thr326lThr 


AG 


330 (42.2) 


302(46.1) 


.31 



GG 85(10.9) 70(10.7) 



No. (%) 



Gene 


Variant 


Genotype 


Cases 


" " I 
Controls 


z-ianea 
P Value 






AA 


394(49.1) 


282(429) 




LTA 


A252G 


AG 


327 (40.8) 


299 (45.5) 


.06 






GG 


.81 (10.1) 


76 (1 1 .6) 






• 


AA 


80(10.0) 


78 (11.6) 




LTA 


Thr26Asn 


AC 


331 (41.4) 


297 (45.3) 


.07 






CC 


389 (48.6) 


280 (42.7) 








AA 


308 (38.3) 


257 (39.1) 




MGP 


Thr83Ala 


AG 


374 (46.5) 


294 (44.7) 


.79 






GG 


123 (15.3) 


106(16.1) 








AA 


110(13.6) 


95(14.5) 




MGP 


-7A/G 


AG 


368(45.7) 


288(43.8) 


.77 






GG 


328 (40.7) 


274(41.7) 








DD 


194 (24.7) 


176(28.4) 




MMP3 


indel 




386(49.1) 


294 (47.5) 


.27 






. II 


206 (26.2) 


149(24.1) 








CC 


350(44.1) 


272(41.3) 




MTHFR 


Ala222Val* 


CT 


341 (43.0) 


320(48.6) 


.06 






TT 


102(12.9) 


66 (10.0) 








GG 


449 (55.8) 


371 (56.5) 




MTP 


-493G/T 


GT 


297 (36.9) 


237 (36.1) 


.95 






TT 


59 (7.3) 


49(7.5) 








AA 


529 (66.0) 


423 (65.7) 




MTR 


Asp919Gty 


AG 


239 (29.8) 


190(29.5) 


.88 






GG 


34 (4.2) 


31 (4.8) 








CC 


22 (2.8) 


15(2.3) 




NPPA 


Ter29ArgArg 


CT 


190(23.9) 


159(24.7) 


.83 






TT 


583 (73.3) 


471 (73.0) 








CC 


649 (80.8) 


543 (83.7) 




OLR1 


Lys167Asn 


CG 


146(18.2) 


103(15.9) 


.26 






GG 


8(1.0) 


3(0.5) 








CC 


347 (47.0) 


288 (44.0) 




P22-PHOX 


His72Tyr§ 


CT 


271 (36.7) 


293(44.7) 


.002 






TT 


121 (16.4) 


74 (11.3) 








DD 


249 (30.9) 


203(30.9) 




PAI1 


indel 


Dl 


398 (49.4) 


314(47.9) 


.77 






II 


159(19.7) 


. 139(21.2) 








CC 


187 (23.3) 


155(23.6) 




PEC AM 1 


Leu125Val 


CG 


395 (49.1) 


312(47.6) 


.82 






GG 


222 (27.6) 


189(28.8) 








AA 


200(25.0) 


163(25.5) 




PEC AM 1 


Ser563Asn 


AG 


386 (48.3) 


312(48.8) 


.92 






GG 


214(26.8) 


165(25.8) 








AA 


396(49.6) 


324 (49.3) 




PON1 


Gln192Arg 


AG 


337 (42.2) 


279(42.5) 


>.99 






GG 


66(8.3) 


54 (8.2) 








CC 


464 (57.9) 


366 (55.6) 




PON2 


Cys311Ser 


CG 


298 (37.2) 


251 (38.1) 


.50 






GG 


40 (5.0) 


41 (6.2) 








CC 


637 (79.2) 


492 (75.9) 




PPARG 


Ala12Pro 


CG 


159 (19.8) 


145 (22.4) 


.22 






GG 


8(1.0) 


11(1.7) 
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Table 2. Genotype Frequencies and P Values in Cases With Acute Coronary Syndrome and Controls (cont) 









No. (%) 


2-Tailed 
P Value 


Gone 


Vaildm 




Cases 


I 

Controls 






cc 


, 15{1.9) 


21 (3.3) 




DTPCO 






202 (25.5) 


174 (27.1) 


.17 






Uo 


576(72.6) 


447 (69.6) 










66 (8.2) 


38 (5.8) 




HtCQL2 


Arg lob/oys 


U 1 


326 (40.5) 


239 (36.7) 


n^ 






TT 
1 1 


412(51.2) 


375 (57.5) 








uo 


740(91.9) 


611 (92.9) 




SELE 


Leuo54Hne 


pT 

U 1 


63 (7.8) 


47(7.1) 


A7 
Ml 






XT 


; 2(0.2) 


0 








A A 

AA 


658 (82.0) 


528 (80.4) 




SELE 


Ser128Arg 


AO 


137 (17.1) 


123 (18.7) 








OU 


7(0.9) 


6 (0.9) 








AA 
AM 


646 (80.2) 


530 (80.8)' 




SELP 


Trir715Pro 


AO 


150(18.6) 


124(18.9) 








CC 


9(1.1) 


2 (0.3) 








AA 
AA 


758 (95.9) 


625 (95.0) 




TFPl 


Val2D4Met . 


A(j 


32 (4.1) 


33 (5.0) 


MO 






GG 


0 


0 








AA 


801 (99.6) 


638 (99.8) 




THBD 


-33G/A 


AG 


3(0.4) 


1 (0.2) 


.63 






GG 


0 


0 . 








AA 


794 (98.6) 


652 (98.9) 




THBD 


Ala25Trir 


AG 


11.(1.4) 


7(1.1). 


.64 






GG 


0 


0 








CC 


531 (67.0) 


433 (65.9) 




THBD 


Ala455Val 


CT . 


237 (29.9) 


203(3.9) 


.91 






TT 


24(3.0) 


21 (3.2) 











No. (%) 


2-Tailed 
P Value 




Variant 




- Cases 


Controls 






AA 


614 (76.3) 


507 (77.2) 




/ nDO i - 


rvsi 1 1 UUOW 


AG 


177 (22.0) 


136(20.7) 


.74 






uo 


14(1.7) 


14(2.1) 








00 


74 (9.4) 


33 (5.0) 




T7-VPC9 


o u i n 1/04. 


O I 


250 (31 .6) 


251 (38.4) 


.001 






TT 
I 1 


466 (59.0) 


370 (56.6) 








r.r 


-49(6.1) 


40(6.1) 




77-/QCM 


AJaoO/ rTO 




268 (33.4) 


229 (34.8) 


.00 








486(60.5) 


389 (59.1) 








AA 
AA 


187 (23.3) 


159(24.2) 




irfrXJ 


AO' loo 


Ap 
AVJ . 


374(46.6) 


324 (49.4) 








00 


241 (30.0) 


173 (26.4) 








AA 


702 (88.7) 


579(88.4) 




TLR4 


Gly299Asp 


AG 


88(11.1) 


76(11.6) 


.89 






GG 


1(0.1) 


0 








AA 


17(2.1) 


14(2.2) 




TNF 


-308G/A 


AG 


189 (23.5) 


176(27.2) 


.27. 






GG 


597 (74.3) 


457 (70.6) 








AA 


784 (97.0) 


627 (95.4) 




TNFRSF1A 


Arg92G!n 


AG 


24 (3.0) 


30(4.6) 


.13 






GG 


0 


0 





*Hardy-Weinberg equilibrium deviation in controls, P<.05 (n = 3). 
|P<.001(n = 1). 

tHardy-Weinberg equilibrium deviation in cases, P<.05 (n = 5). 
§P<.001(n = 2). 



(continued) 



Male and female cases were signifi- 
cantly more likely to be current smok- 
ers and to have type 2 diabetes mellitus 
but less likely to consume at least 1 al- 
coholic drink per month. Frequencies of 
hypercholesterolemia and hyperten- 
sion were higher in female cases than in 
controls; no significant differences were 
observed in males. Previous revascular- 
ization had been performed in 35.6% of 
incident ACS cases and in none of the 
controls. 

A total of 85 variants in 70 genes were 
genotyped in cases and controls.The 
overall genotype call rate for these vari- 
ants was 98.5% (range, 95.0%-99.8%). 
Two percent of all samples were geno- 
typed in duplicate for each marker in a 
blinded fashion as a measure of geno- 
type reproducibility. Among the 2511 
repeated genotypes, 5 were discordant, 
demonstrating a reproducibility of 99.8%. 



Tests of Hardy-Weinberg equilib- . 
rium revealed that 1 variant violated it 
in both cases and controls, at the P<.05 
level; 7 violated it in cases only; and 4 
violated it in controls only (Table 1 and 
Table 2). This finding is not more than 
expected by chance (4 violations ex- 
pected by chance in each group; see the 
Methods section) and therefore none 
was excluded from further analysis at 
this stage. 

With respect to power parameters, the 
mean effective frequency (or 1-fre- 
quency, if q >0.5) in controls of the pu- 
tative risk variants studied was 0.20, and 
58 (68.2%) were common, (>0.1), 25 
(29.4%) were uncommon (<0.1; 
>0.01), and 2 (2.4%) were rare (<0.01). 
Our sample had 80% power to confirm, 
by the Monte Carlo x 2 test, a genotype- 
specific relative risk of 2.3 for a rare vari- 
ant (q=0.01), 1.4 for a relatively uncom- 



mon variant (q = 0.1), and 1.25 for a 
common allele (q = 0. 5) . 

We tested whether each putative risk 
variant showed a significant difference 
in frequency between cases and con- 
trols (Table 1) . An odds ratio greater than 
1 indicates that the risk genotype was in 
higher frequency among cases, and if so, 
the genotype frequency difference was 
reported as a positive decimal number. 
Only 1 genetic variant was significant at 
the P<.05 level, which is the number 
most likely by chance alone. The -455 
variant, which lies upstream of the tran- 
scription initiation site in the p-fibrino- 
gen gene, replicated the originally 
reported association, with the GG geno- 
type being more frequent in cases than 
controls (frequency, 66% in cases vs 61% 
in controls; odds ratio, 1.27; P=.03). In 
addition, we found the MEF2A 21-bp 
deletion in 1 case and 1 control, con- 
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Table 3- Characte ristics of 1461 White Participants Cenotyped for 85 Genetic Variants* 
■ ■ '■ Men Women 

(n = 944) (n = 517) 



Characteristics 


i 

ACS Cases 
(n = 550) 


l 

Controls 
(n = 394) 


I 

ACS Cases 
(n = 261) 


-| 

' Controls 
(n = 256) 


Age, mean (SD), y 


6.7 (12.5) 


6.0 (12.1) 


63.1 (13.2) 


61.8 (12.8) 


Body mass index, mean (SD)t 


29.1 (5.5) 


27.9 (5.0) 


29.9 (6.9) 


27.7 (6.9) 
, — - 


Family history. of CAD/MI 


279 (50.7)* 


109(27.7) 


135 (51.7)* 


90 (35.5) 


Prior myocardial infarction 


142 (25.8)* 


0 


74 (28.4)* 


0 


Prior revascularization 


205(37.3)* 


0 


83 (31.8)* 


0 


Congestive heart failure 


23 (4.2)* 


0 


18(6.9)* 


0 


Hypertension 


305 (55.5) 


207 (52.5) 


182 (69.7)* 


126(49.2) 


Type 2 diabetes mellitus 


116(21.1)* 


42 (10.7) 


77 (29.5)* 


35 (13.7) 


Hypercholesterolemia 


314(57.1) 


208 (52.8) 


162 (62.1)* 


117(45.7) 


Postmenopausal . 






189 (68.6)* 


219 (85.5) 


College graduate 


166(30.2)* 


238(60.4) 


40 (15.3)* 


72 (28.1) 


Smoking <30 d ago 


183(33.3)* 


55 (14.0) 


85 (32.6)* 


31(12.1) 


Alcohol frequency >1/mo 


221 (40.2)* 


210(53.3) 


38 (14.6)* 


84 (32.8) 



Abbreviations: a<^o. acme curur idiy syuuiumw, w/-^, —1 - v 

♦Data are presented as number (percentage) unless otherwise indicated. 

tBody mass index is calculated as weight in kilograms divided by height in meters squared. 

*P<.001 for the comparison with controls of the same sex. 



firming that this is a rare variant in the 
population. 105 

Several supplementary analyses were 
performed. When the genotypes of cases 
and controls were analyzed by exten- 
sion of 2 X 3 x 2 tests to 100 000 simula- 
tions, 4 loci, RECQL2, THBS2, UPC, and 
p22-PHOX, were marginally significant 
(Table 2). In each case, the specific ge- 
netic risk model providing significance 
was different from that reported in the 
literature; hence, these cannot be con- 
sidered formal replications and the total 
number of positive associations is not in 
excess of random expectations. 

Finally, we found that only 41 of 84 
predefined risk variants were even mar- 
ginally more frequent in cases than in 
controls (excluding 1 tie, the rare 
MEF2A deletion), representing a 48.8% 
win rate (95% confidence interval, 
38.1%-59.5%) for the collective-risk 
genotypes. This observed proportion of 
wins is not different from the ex- 
pected proportion (50%) under the null 
hypothesis (P= .91). Table 1 shows that 
the absolute differences in risk geno- 
type frequencies between cases and con- 
trols (negative signs meaning that the 
putative risk genotype was more fre- 
quent in controls than in cases) were 
small, with a median difference of 



-0.0003, and maximum of 0.056 (p fi- 
brinogen). 

COMMENT 

We were unable to confirm as risk fac- 
tors for ACS 85 genetic variants be- 
cause none was unequivocally vali- 
dated in this large case-control study 
of 1461 participants. In the primary 
analysis, only the -455 promoter vari- 
ant in P-fibrinogen) was nominally sta- 
tistically significant (P= .03). Among the 
4 variants in the secondary analysis that 
met nominal statistical thresholds, there 
was an excess of a different variant than 
was previously reported among cases 
in the original study, which does not 
support replication. We therefore con- 
clude that our findings, in this large 
sample of well-characterized ACS pa- 
tients and controls, cannot support that 
this panel of gene variants contains 
bona fide ACS risk factors. 

Our findings come at a critical junc- 
ture in complex disease genetics. Some 
cardiovascular gene variants (eg, ACE, 
AGT, AGTR1, 1TGB3, F2, F5, MTHFR) 
included in our study can already be or- 
dered clinically, for indications that ex- 
plicitly include possible ACS risk. How- 
ever, our findings suggest that such 



underscore the importance of robust rep- 
lication studies of reported associations 
prior to their application to clinical care. 

These nonreplications include vari- 
ants in several high-profile studies. For 
example, haplotypes A and B of 5-li- 
poxygenase activating protein 
(AWX5AP) were reported in 1 study 
to be associated with Ml in the general 
populations of Iceland, and the United 
Kingdom, respectively. 17 We found nei- 
ther haplotype was associated with ACS, 
in spite of our observed haplotype fre- 
quencies in cases and controls closely 
approximating those found in the total 
United Kingdom data set (cases and 
controls) previously (haplotype A, 
0.165 vs 0.160, respectively; haplo- 
type B, 0.062 vs 0.058). 

Although our study raises signifi- 
cant doubts about the collective panel 
of putative genetic risk factors, it does 
not invalidate any particular previous 
study. Possible explanations of our 
negative results could include: (1) false- 
negative results in our study; (2) false- 
positive associations in previous stud- 
ies; and (3) varied effects of risk variants 
in different genetic backgrounds. 

False-negative results as a general ex- 
planation for our study's null findings 
are unlikely given that our sample size 
is substantially larger than all but a few 
reported prior studies and was pow- 
ered to detect modest relative risks. 
Based on a random sample (n=30) of 
articles included in this study (1 per 
gene variant), we estimated that the 
mean odds ratio reported in positive 
studies was 2.3 (range, 1.25-5.0), in- 
dicating that we had well in excess of 
80% power to replicate most reports. 
However, isolated positive reports may 
overestimate genetic risks. 3,6 Re- 
cently, a meta-analysis of 14 genes in- 
cluded in our study reported odds ra- 
tios ranging from 1 .10 to 1.-73 for risk 
of Ml. 3 It is possible that minute odds 
ratios are to be expected in complex dis- 
ease genetics and that neither our study 
nor most previous studies were suffi- 
ciently powered. Accordingly, we aug- 
mented our power, by use of the Sign 
test, to detect a surplus of as few as 16 
weakly positive genetic risk factors 
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among the entire set that we geno- 
typed (84 -16 = 50, the number re- 
quired for a significant Sign test), cor- 
responding to a mean odds ratio of 1 .05 
or higher given our sample size and the 
average risk genotype frequency. 

Absence of genetic effect only in our 
cohort is also unlikely. Cases showed 
a 2-fold higher family history of ACS, 
consistent with a genetic effect con- 
tributing to phenotypes in this cohort. 
In addition, homozygosity coding for 
an arginine residue at position 158 of 
apolipoprotein E (E4 variant), consid- 
ered 1 of the least controversial of the 
putative ACS susceptibility factors de- 
spite some inconsistency in certain co- 
horts, 106 was significantly associated 
(P= .04) among cases with hyperlipid- 
emia (4.1%) vs controls without hy- 
perlipidemia (1.6%). 

False-positive results in previous 
studies are another potential explana- 
tion for the discrepancy between our 
findings and those of others. This is- 
sue has previously been recognized as 
a serious problem with association stud- 
ies, particularly when sample sizes are 
underpowered. 107 It is difficult to iden- 
tify true vs false positives by analysis of 
the literature alone. 108 Unrecognized 
stratification between cases and con- 
trols can create spurious associa- 
tions, 109 and the absence of negative ge- 
nomic controls in. nearly all prior 
studies to exclude this possibility leaves 
this an open question. Also difficult to 
assess is the extent to which publica- 
tion bias and multiple hypothesis test- 
ing have had an effect. 

It could be argued that our research 
participants are distinct from those re- 
ported previously and that our results 
may not bear on the validity of posi- 
tive associations reported in different 
populations and clinical subgroups (eg, 
analyses substratified by age, sex, or a 
clinical variable, such as hyperten- 
sion, hyperlipidemia, or smoking sta- 
tus). Given that the vast majority of 
common variants in the human ge- 
nome date to our shared ancestry in 
Africa, 110 it is not likely that there are 
different common functional variants 
in linkage disequilibrium with risk vari- 



ants in our population vs others. Less 
common mutations of more recent an- 
cestral origin could conceivably be cor- 
related with certain genetic variants in 
one population but not another. The ex- 
tent to which linkage disequilibrium 
patterns might explain our findings is 
unknown, but our study populauon is 
quite typical of the mixed European 
background that is prevalent in the 
United States. 

Another possibility is that the effect 
of risk variants is different in different 
genetic backgrounds; if true, the lack 
of generalizability of results will se- 
verely limit their application to the clini- 
cal arena. The fact that we failed to rep- 
licate positive associations in a 
consecutive series of study partici- 
pants that are broadly representative of 
the disease encountered in clinical prac- 
tice places limitations on the potential 
applicability of prior findings and sup- 
ports our premise that it is premature 
to extrapolate these earlier findings to 
routine clinical care. 

The failure of the candidate gene ap- 
proach to identify variants conferring 
susceptibility to ACS risk prompts con- 
sideration of other approaches. One 
promising approach is to screen the en- 
tire genome in an unbiased way in a 
large sample for variants that are sig- 
nificantly associated with disease risk. 
Coupled with the understanding of un- 
derlying patterns of linkage disequilib- 
rium in the human genome 7 and the 
ability to inexpensively obtain geno- 
types across the genome, the field is 
moving rapidly toward a comprehen- 
sive genome-wide approach. Chal- 
lenges of this approach include the un- 
known number of variants that impart 
effect, the magnitude of the effect im- 
parted by each, and the extent to which 
common variants as opposed to rare in- 
dependent mutations account for dis- 
ease risk. 

Regardless of the approach taken, it 
is clear that multiple large, well- 
matched cohorts of cases and controls 
will be required to achieve valid 
progress in the genetic analysis of ACS 
and other complex human diseases. 
Our null findings indicate the need for 



caution in the interpretation of ge- 
netic associations in different clinical 
populations and the need for exten- 
sive validation of genetic risk factors. 
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Special Report 



Prediction of Coronary Heart Disease Using Risk 

Factor Categories 

Peter W.R Wilson, MD; Ralph B. D'Agostino, PhD; Daniel Levy, MD; Albert M. Belanger, BS; 

Halit Silbershatz, PhD; William B. Kannel, MD 

Background — The objective of this study was to examine the association of Joint National Committee (JNC-V) blood 
pressure and National Cholesterol Education Program (NCEP) cholesterol categories with coronary heart disease (CHD) 
risk, to incorporate them into coronary prediction algorithms, and to compare the discrimination properties of this 
approach with other noncategorical prediction functions. 

Methods and Results — This work was designed as a prospective, single-center study in the setting of a community-based 
cohort. The patients were 2489 men and 2856 women 30 to 74 years old at baseline with 12 years of follow-up. During 
the 12 years of follow-up, a total of 383 men and 227 women developed CHD, which was significantly associated with 
categories of blood pressure, total cholesterol, LDL cholesterol, and HDL cholesterol (all P<.001). Sex-specific 
prediction equations were formulated to predict CHD risk according to age, diabetes, smoking, JNC-V blood pressure 
categories, and NCEP total cholesterol and LDL cholesterol categories. The accuracy of this categorical approach was 
found to be comparable to CHD prediction when the continuous variables themselves were used. After adjustment for 
other factors, «*28% of CHD events in men and 29% in women were attributable to blood pressure levels that exceeded 
high normal (> 130/85). The corresponding multi variable-adjusted attributable risk percent associated with elevated 
total cholesterol (>200 mg/dL) was 27% in men and 34% in women. 

Conclusions — Recommended guidelines of blood pressure, total cholesterol, and LDL cholesterol effectively predict CHD 
risk in a middle-aged white population sample. A simple coronary disease prediction algorithm was developed using 
categorical variables, which allows physicians to predict multivariate CHD risk in patients without overt CHD. 
(Circulation. 1<>98;97: 1837-1847.) 

Key Words: coronary disease ■ prediction ■ hypertension ■ cholesterol 



Coronary heart disease continues to be a leading cause of 
morbidity and mortality among adults in Europe and 
North America. 1 Risk factors have included blood pressure, 
cigarette smoking, cholesterol (TC), LDL-C, HDL-C, and 
diabetes. 2 " 4 Factors such as obesity, left ventricular hypertro- 
phy, family history of premature CHD, and ERT have also 
been considered in defining CHD risk. 5 " 7 Data from popula- 
tion studies enabled prediction of CHD during a follow-up 
interval of several years, based on blood pressure, smoking 
history, TC and HDL-C levels, diabetes, and left ventricular 
hypertrophy on the ECG. These prediction algorithms have 
been adapted to simplified score sheets that allow physicians 
to estimate multivariable CHD risk in middle-aged patients. 8 

See p 1761 

The present article develops a simplified coronary predic- 
tion model, building on the blood pressure, cholesterol, and 
LDL-C categories proposed by the JNC-V and NCEP ATP 
jj 7.9.10 The ana iy S i s evaluates the utility and accuracy of blood 
pressure, cholesterol, and LDL-C recommended categories in 
multivariable CHD prediction, using a Framingham Heart 



Study sample that pooled information for the original and 
offspring cohorts and followed them for 12 years. This 
approach emphasizes the established, powerful, independent, 
and biologically important factors. Family history for heart 
disease, physical activity, and obesity are not included be- 
cause these factors work to a large extent through the major 
risk factors, and their unique contribution to CHD prediction 
can be difficult to quantify. The prediction of initial CHD 
events in a free-living population not on medication is 
emphasized. Consequently, ERT for postmenopausal women, 
treatment of high blood pressure, and therapy for high blood 
cholesterol are not included in the formulations. 

Methods 

The population-based sample used for this report included 2489 men 
and 2856 women 30 to 74 years old at the time of their Framingham 
Heart Study examination in 1971 to 1974. Participants attended 
either the 11th examination of the original Framingham cohort 11 or 
the initial examination of the Framingham Offspring Study. 12 Similar 
research protocols were used in each study, and persons with overt 
CHD at the baseline examination were excluded. 



From the Framingham Heart Study, National Heart, Lung, and Blood Institute, Framingham, Mass (P.W.F.W., D.L.); Boston University Mathematics 
Department, Boston, Mass (R.B.D., A.M.B., H.S.); and Framingham Heart Study, Boston University School of Medicine, Framingham, Mass (W.B.K.). 
Reprint requests to Dr Peter W.F. Wilson, Framingham Heart Study, National Heart, Lung, and Blood Institute, 5 Thurber St, Framingham, MA 01701. 
E-mail peter@fram.nhlbi.nih.gov Score' sheets are on the internet at http://w ww.nhlbi.nih.gov/nhlbi/fram/ 
© 1998 American Heart Association, Inc. 
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Selected Abbreviations and Acronyms 
CHD = coronary heart disease 
ERT = estrogen replacement therapy 
HDL-C = HDL cholesterol 

JNC-V = Fifth Joint National Committee on Hypertension 
LDL-C = LDL cholesterol 
NCEP ATP II = National Cholesterol Education Program, Adult 
Treatment Panel II 
TC = total cholesterol 
VLDL-C = VLDL cholesterol 



At the 1971-1974 examination, a medical history was taken and a 
physical examination was performed by a physician. Persons who 
smoked regularly during the previous 12 months were classified as 
smokers. Height and weight were measured, and body mass index 
(kg/m 2 ) was calculated. Two blood pressure determinations were made 
after the participant had been sitting at least 5 minutes, and the average 
was used for analyses. Hypertension was categorized according to blood 
pressure readings by JNC-V definitions 10 : optimal (systolic 
<120 mm Hg and diastolic <80 mmHg), normal blood pressure 
(systolic 120 to 129 mm Hg or diastolic 80 to 84 mm Hg), high normal 
blood pressure (systolic 130 to 139 mmHg or diastolic 85 to 
89 mmHg), hypertension stage I (systolic 140 to 159 mmHg or 
diastolic 90 to 99 mm Hg), and hypertension stage II-IV (systolic ^160 
or diastolic ^100 mm Hg). When systolic and diastolic pressures fell 
into different categories, the higher category was selected for the 
purposes of classification. Blood pressure categorization was made 
without regard to the use of antihypertensive medication. 

Diabetes was considered present if the participant was under treat- 
ment with insulin or oral hypoglycemic agents, if casual blood glucose 
determinations exceeded 150 mg/dL at two clinic visits in the original 
cohort, or if fasting blood glucose exceeded 140 mg/dL at the initial 
examination of the Offspring Study participants. Blood was drawn at the 
baseline examination after an overnight fast, and EDTA plasma was 
used for all cholesterol and triglyceride measurements. Cholesterol was 
determined according to the Abell-Kendall technique, 13 and HDL-C was 
measured after precipitation of VLDL and LDL proteins with heparin- 
magnesium according to the Lipid Research Clinics Program protocol. 14 
When triglycerides were <400 mg/dL, the concentration of LDL-C was 
estimated indirecUy by use of the Friedewald formula 15 ; for triglycerides 
>400 mg/dL, the LDL-C was estimated directly after ultracentrifuga- 
tion of plasma and measurement of cholesterol in the bottom fraction 
(plasma density <1.006). 16 

Cutoffs for TC (<200, 200 to 239, 240 to 279, and =>280 mg/dL), 
LDL-C (<130, 130 to 159, and >160 mg/dL), HDL-C (<35, 35 to 
59, and >60 mg/dL), cigarette smoking, diabetes, and age were 
considered in this report. The cholesterol and LDL-C cutoffs are 
similar to those used for the NCEP ATP II guidelines and were partly 
dictated by the number of persons with higher levels of TC or 
LDL-C. For those reasons, we have provided information for 
cholesterol categories of 240 to 279 and >280 mg/dL and for LDL-C 
>160 mg/dL. Too few persons had LDL-C ^190 mg/dL to provide 
stable estimates for CHD risk. Study subjects were followed up over 
a 12-year period for the development of CHD (angina pectoris, 



recognized and unrecognized myocardial infarction, coronary insuf- . 
ficiency, and coronary heart disease death) according to previously 
published criteria. "Hard CHD" events included total CHD without 
angina pectoris. 17 Surveillance for CHD consisted of regular exam- 
inations at the Framingham Heart Study clinic and review of medical 
records from outside physician office visits and hospitalizations. 

Statistical tests included age-adjusted linear regression or logistic 
regression to test for trends across blood pressure, TC, LDL-C, and 
HDL-C categories. 18 Age-adjusted Cox proportional hazards regres- 
sion and its accompanying c statistic were used to test for the relation 
between various independent variables and the CHD outcome and to 
evaluate the discriminatory ability of various prediction models. 19,20 
The 12-year follow-up was used in the proportional hazards models, 
and results were adapted to provide 10-year CHD incidence esti- 
mates. Separate score sheets were developed for each sex using TC 
and LDL-C categories. These sheets adapted the results of propor- 
tional hazards regressions by use of a system that assigned points for 
each risk factor based on the value for the corresponding /3-coeffi- 
cient of the regression analyses. 

The relative risk, but not the attributable risk, for TC and CHD 
declines with advancing age. 21 Quadratic terms for age were consid- 
ered in the models for the score sheets. Furthermore, CHD risk is 
associated with HDL-C in the elderly, 22 " 24 and interaction terms for 
TC and age were also considered in the development of the 
prediction models. 22 Among women, an age-squared term was found 
to be significant in the prediction models and was incorporated into 
the score sheets. Neither ageXTC nor ageX LDL-C was found to be 
significant in either sex. 

Score sheets for prediction of CHD using TC and LDL-C 
categorical variables were developed from the'0-coefficients of Cox 
proportional hazards models. The TC range was expanded in 
40-mg/dL increments to include >160 mg/dL and ^280 mg/dL, the 
HDL-C range 35 to 59 mg/dL was partitioned to provide three levels 
for each sex, and both optimal and normal blood pressure categories 
were included. The score sheets provide comparison 10-year abso- 
lute risks for persons of the same age and sex for average total CHD, 
average hard CHD (total CHD without angina pectoris), and low-risk 
total CHD. Risk factors are shaded, ranging from very low relative 
risk to very high. Such distinctions are arbitrary but provide a 
foundation to determine the need for clinical intervention. 

Results 

At initial examination, study subjects ranged in age from 30 to 
74 years, and the mean age±SD was 48.6 ± 1 1 .7 years for 2489 
men and 49.8± 12.0 years for 2856 women. Because there were 
relatively few persons at the higher stages of hypertension in the 
Framingham sample, stages n, IE, and IV hypertension were 
combined into a single category in the analyses (Table 1). 
Approximately half of the subjects for each sex had blood 
pressure levels in the normal or optimal range. 

The age-adjusted means for various risk factors according to 
blood pressure categories are shown for men and women in Table 
2. Therapy for hypertension (P<ffl\ men, P<.001 women), more 
frequent diabetes (P<.001 mea P<£0\ women), greater body 



TABLE 1. Characteristics of Participants According to JNC-V 
Hypertension Categories* 







Blood Pressure 






Systolic, mm Hg 


Diastolic, mm Hg 


Men, % 


Women, % 


Normal (including optimal) 


<130 


<85 


44 


55 


High normal 


130-139 


85-89 


20 


15 


Hypertension stage I 


140-159 


90-99 


23 


19 


Hypertension stage IMV 


>160 


>100 


13 


11 



ignoring blood pressure therapy. 
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TABLE 2. Age-Adjusted Mean Levels and Prevalence of Risk Factors According to Blood 
Pressure Category 





Not Hypertensive 


Hypertensive 


P, 

Test for Trend* 


Normal 


High Normal 


Stage I 


Stage IHV 


Men 


(n=1097) 


(n=500) 


(n=567) 


(n=325) 




Hypertensive therapy, % 


1.6 


2.7 


10.1 


25.0 


<.001 


Body mass index, kg/m 2 


25.8 


26.7 


27.5 


28.3 


<.001 


Cigarette use, % 


43.1 


41.8 


35.4 


38.2 


.010 


Diabetes, % 


o.u 


fi 1 

U. 1 


4.0 


11.2 


<.001 


TC, mg/dL 


210.1 


214.3 


218.0 


213.9 


.004 


LDL-C, mg/dL 


149 7 


143.4 


144.5 


139.7 


.638 


HDL-C, mg/dL 


44.4 


45.7 


44.8 


44.5 


.674 


Women 


(n=1578) 


(n=424) 


(n=535) 


(n=319) 




Hypertensive therapy, % 


3.9 


9.4 


18.0 


33.6 


<.001 


Body mass index, kg/m 2 


23.9 


25.8 


26.3 


26.9 


<.001 


Cigarette use, % 


39.4 


37.3 


33.9 


35.9 


.071 


Diabetes, % 


2.6 


3.4 


4.9 


9.8 


<.001 


TC, mg/dL 


214.1 


223.0 


224.4 


218.5 


<.001 


LDL-C, mg/dL 


138.3 


143.9 


146.8 


138.9 


.031 


HDL-C, mg/dL 


58.6 


58.2 


55.9 


55.7 


<.001 



Test for linear trend across blood pressure categories after age adjustment. For dichotomous variables, logistic regression was 
done. 



mass index (/><.001 men, P<.00\ women), and higher TC level 
(p=.004 men, P<Wl women) were consistently associated 
with higher blood pressure categories in both sexes. Cigarette 
smoking was inversely associated with blood pressure in men 
(/ > =.010), but only a borderline association was present in 
women (P=.071). The lipoprotein fractions HDL-C 
(P<.00\) and LDL-C (P=.031) were significantly associated 
with blood pressure category in women but not in men. 

Age-adjusted 10-year CHD rates for blood pressure and 
cholesterol categories are shown for men and women in Table 3. 
In prediction models, the CHD rates were significantly associ- 
ated with the specified categories of blood pressure, TC, 
HDL-C, and LDL-C (all P<Wl for both sexes). The number of 
CHD events arising at each blood pressure and cholesterol 
category is also given. For blood pressure, the greatest number 
of CHD cases arose from the stage I hypertension category for 
both sexes. Conversely, the greatest number of CHD cases arose 
from the highest lipoprotein cholesterol levels (LDL-C >160 
mg/dL or cholesterol >240 mg/dL). 

Multivariable risk calculations for TC categories are shown in 
Table 4. Normal or optimal blood pressure was used as the 
reference level, and estimated relative risk rose from 1 .00 for normal 
or optimal blood pressure to 1.84 in men and 2.12 in women with 
stage H-TV hypertension. Similarly, for TC, the estimated relative 
risk rose from 1 .00 for levels <200 mg/dL to 1 .90 in men and 1 .72 
in women with TC >240 mg/dL. When typical HDL-C levels (35 
to 59 mg/dL) were used as a reference, CHD risk was increased 
among men and women with low HDL-C (<35 mg/dL) and CHD 
risk was correspondingly decreased among subjects with high 
HDL-C (>60 mg/dL). The rxjpulation-attributable risk percent 
associated with hypertension was 6% for high normal, 13% for 
stage I, and 9% for stage D-iY hypertension among men. The 
corresponding values were 5% for high normal, 13% for stage I, 



and 12% for stage II-IV hypertension among women. An overall 
estimate of the attributable risk percent for blood pressure level 
greater than normal was 28% in men and 29% in women. When . 
cholesterol <200 mg/dL was used as the reference range, attribut- 
able risks were 10% for TC 200 to 239 mg/dL and 17% for TC 
>240 mg/dL in men and 12% for TC 200 to 239 mg/dL and 22% 
for TC >240 mg/dL in women. The overall estimate of the 
attributable risk percent for TC level >200 mg/dL was 27% in men 
and 34% in women. 

Multivariable risk calculations for LDL-C categories are 
shown in Table 5, and these results parallel the presentation in 
Table 4. When LDL-C <130 mg/dL is used as the reference 
range, a greater absolute CHD risk is associated with higher 
LDL-C categories, but the magnitude of the relative risk and 
its statistical significance are very similar to that observed for 
the categories of TC (Table 4). 

The efficacy of prediction with continuous variables was 
compared with that obtained with categorical variables and a risk 
factor sum (Figs 1 and 2 for men and women, respectively). For 
calculation of the risk factor sum, the levels considered were age 
(>45 years for men, ^55 years for women), hypertension 
(systolic blood pressure >140 mm Hg, diastolic blood pressure 
>90 mm Hg, or use of antihypertensive medication), smoking, 
diabetes, elevated cholesterol (cholesterol >240 mg/dL or 
LDL-C >160 mg/dL), and HDL-C <35 mg/dL. One point was 
given for each risk factor, for a possible score of 0 to 7 points. 
A greater area under the curve indicated better predictive 
capability. The curves were nearly identical for the continuous 
and categorical formulations, TC and LDL-C categories had 
similar effects, and the risk factor sums tended to have the lowest 
predictive potential. The c statistic, a measure of the discrimi- 
natory ability of a model, equal to the area under the receiver 
operating characteristic curve, provides a guide to interpret the 



Downloaded from circ.ahajournaIs.org by on October 1 6, 2007 



1840 Prediction of Coronary Heart Disease 



TABLE 3. CHD Risk According to Blood Pressure and Lipid Categories 







Men 






Women 




Person-Years 


NO. Ot 

Events (%) 


Age- Ad justed 
10-Year Rate 


Person-Years 


IMO. OT 

Events (%) 


Age-nQjusieo 
10-Year Rate 


Tntal 


30 154 






38057 


227 (100) 




biooa pressure 














Normal (inciuainy opurnaij 


\0 Dc.*f 


I I u ^tij; 


7 ft 


, 20 747 


66 (29) 


2.9 


Minh nnrmal 
niy 11 MUlllldl 


uou/ 


77 f?01 


12.4 


6056 


36 (16) 


7.1 




6695 


115 f30^ 


16.0 


7254 


72 (32) 


13.9 


Hypertension siage ihv 




ft1 /91 \ 




4000 


53 (23) 


14.1 


TP mn/HI 














-conn 


11 


1 UJ f } 


8.2 


13 289 


39(17) 


3.1 


£.\J\j—COU 


11 7Q9 

1 1 (at 


1 4ft HQ) 

1 HO 


12.0 


12 683 


80 (35) 


6.6 




R771 
Of f 1 


1 oc (04; 


IOC 
1 0.0 


10 OAR 


1 Oft MA) 

1 UO \HO/ 


10 "\ 


HDL-c, mg/dL 
















5601 


97 (25) 


15.8 


1506 


23 (10) 


14.7 


35-59 


21 151 


260(68) 


12.0 


20 788 


146(64) 


7.5 


>60 


3409 


26(7) 


8.2 


15 761 


58(26) 


3.9 


LDL-C, mg/dL 














<130 


11 142 


104 (27) 


7.3 


15 835 


50 (22) 


2.3 


130-159 


10 384 


124 (32) 


11.3 


10 455 


64 (28) 


6.5 


>160 


8628 


155 (41) 


17.3 


11 767 


113(50) 


10.6 



The age-adjusted 10-year CHD rates were calculated from the Cox proportional hazards model, based on 12 years of follow-up. 



results plotted in Figs 1 and 2. The c statistics associated with TC 
categories were 0.74 in men and 0.77 in women for continuous 
variables by proportional hazards or accelerated failure models, 11 
0.73 in men and 0.76 in women for categorical variables, and 
0.69 in men and 0.72 in women for the risk factor sum. The 



corresponding c statistics associated with LDL-C categories 
were 0.74 in men and 0.77 in women for continuous variables by 
proportional hazards or accelerated failure models, 11 0.73 in men 
and 0.77 in women for categorical variables, and 0.68 in men 
and 0.71 in women for the risk factor sum. 



TABLE 4. Multivariable-Adjusted Relative Risks for CHD According to 
TC Categories 



Men 



Women 





Relative Risk 


95% CI 


Relative Risk 


95% CI 


Age, y 


1.05* 


1.04-1.06 


1.04* 


1.03-1.06 


Blood pressure 










Normal (including optimal) 


1.00 


Referent 


1.00 


Referent 


High normal 


1.31 


0.98-1.76 


1.30 


0.86-1.98 


Hypertension stage 1 


1.67t 


1.28-2.18 


1.73* 


1.19-2.52 


Hypertension stage IHV 


1.84* 


1.37-2.49 


2.12* 


1.42-3.17 


Cigarette use (y/n) 


1.68* 


1.37-2.06 


1.47* 


1.12-1.94 


Diabetes (y/n) 


1.50* 


1.06-2.13 


1.77* 


1.16-2.69 


TC, mg/dL 










<200 


1.00 


Referent 


1.00 


Referent 


200-239 


1 : 31* 


1.01-1.68 


1.51* 


1.01-2.24 


>240 


1.90* 


1.47-2.47 


1.72* 


1.15-2.56 


HDL-C, mg/dL 










<35 


1.47* 


1.16-1.86 


2.02* 


1.29-3.15 


35-59 


1.00 


Referent 


1.00 


Referent 


>60 


0.56* 


0.37-0.83 


0.58+ 


0.43-0.79 



The multivariate models were performed separately for men and women. Each model included 
simultaneously all variables listed in the table. All analyses used categorical variables. 
*.01<P<.05, +.001<P<.01, *P<.001. 
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TABLE 5. Multivariate-Adjusted Relative Risks for CHD According to 
LDL-C Categories 



Men 



Women 





Relative Risk 


95% CI 


Relative Risk 


95% CI 


Aae v 


1.05* 


1.04-1.06 


1.04* 


1.03-1.06 


Rlnnrl nrpQQiirp 
DIUUU picoouic 










Normal finrlnHinn nntimah 


1.00 


Rpfprpnt 


1.00 


Referent 


Hinh normal 


1.32 


0 98-1 78 


1.34 


0.88-2.05 


HunprtpriQinn otanp 1 


1.73$ 


1 .32-2.26 


1.75+ 


1.21-2.54 


Nvnortonoinn ctano II 

nypcriciibiun oiayc 11 


1 Q9+ 




2.19t 


1 .46-3.27 




1 71 + 


1 .39-2.1 0 


1.49t 


1.13-1 .97 


Diabetes (y/n) 


1.47* 


1.04-2.08 


1.80t 


1.18-2.74 


LDL-C, mg/dL 










<130 


1.00 


Referent 


1.00 


Referent 


130-159 


1.19 


0.91-1.54 


1.24 


0.84-1.81 


>160 


1.74* 


1.36-2.24 


1.68t 


1.17-2.40 


HDL-C, mg/dL 










<35 


1.461 


1.15-1.85 


2.08t 


1.33-3.25 


35-59 


1.00 


Referent 


1.00 


Referent 


>60 


.0.61* 


0.41-0.91 


0.64f 


Q.47-0.87 



The multivariate models were performed 
simultaneously all variables listed in the table. All 
*.01<P<.05, t.00KP<.01, +F<.001. 



for men and women. Each model included 
used categorical variables. 



Score sheets were developed to predict CHD in men (Fig 
3) and women (Fig 4) from the /^-coefficients of Cox 
proportional hazards models (Table 6). Among women, an 
age-squared term was found to be significant and was 
incorporated into the score sheets. The average CHD risk 
over a period of 10 years tends to plateau slightly in the oldest 
men and women. 

An illustrative example for Fig 3 follows. The subject is a 
55-year-old man with a TC of 250 mg/dL, HDL-C of 39 
mg/dL, and blood pressure of 146/88 who is diabetic and a 
nonsmoker. Proceeding through the steps gives us the follow- 



ing results: Step 1: Age 55=4 points. Step 2: TC 250 
mg/dL =2 points. Step 3: HDL-C 39 mg/dL=l point. Step 4:. 
Blood pressure 146/88 mm Hg=2 points. Step 5: Diabetic=2 
points. Step 6: Nonsmoker=0 points. Step 7: Point total was 
4+2+1+2+2+0=11. Step 8: Estimated 10-year CHD risk 
is 31%. Step 9: The average and "low-risk" risks of CHD 
over a period of 10 years for a 55-year-old man are 16% and 
7%, respectively (low risk was calculated for a person the 
same age, optimal blood pressure, TC 160 to 199 mg/dL, 
HDL-C 45 mg/dL for men or 55 mg/dL for women, non- 
smoker, and no diabetes). Dividing the subject's risk by the 




Figure 1. Receiver operating characteristic curves 
for prediction of CHD in Framingham men over a 
period of 12 years. Separate plots were used for 
continuous, categorical, and risk factor sum mod- 
els, according to whether TC or calculated LDL-C 
was used. 



0.4 0.6 
False Positive 
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Figure 2. Receiver operating charac- 
teristic curves for prediction of CHD in 
Framingham women over a period of 
12 years. Separate plots were used for 
continuous, categorical, and risk factor 
sum models, according to whether TC 
or calculated LDL-C were used: 



0.4 0.6 
False Positive 



average risk provides an estimate of the relative risk: 31% 
divided by 16%= 1.94. Use of the LDL-C approach in the 
score sheets is appropriate when fasting LDL-C estimates are 
available, by use of ultracentrifugation techniques, the 
Friedewald formula, or newer LDL-C assays. 15,25,26 The ap- 
proach is analogous to that shown for TC categories. 

Discussion 

For the past two decades it has been possible to estimate CHD 
risk by use of regression equations derived from observa- 
tional studies, and the present study demonstrates similar 
results, predicting later CHD in a middle-aged white popula- 
tion sample. Prediction models have typically been based on 
the logistic function, although the Weibull distribution has 
also been used. 11,22 Formulations have often included age, sex, 
blood pressure, TC, HDL-C, smoking, diabetes, and left 
ventricular hypertrophy. 11 The prediction of CHD has taken 
the form of sex-specific equations that were developed from 
a single study and applied to other populations or individuals. 
Age, TC, HDL-C, and blood pressure were used in the 
equations as continuous variables, in contrast to dichotomous 
variables (yes/no) such as smoking, diabetes, and left ven- 
tricular hypertrophy. 

The present study builds on the prior experience of CHD 
prediction with continuous variables and integrates the cate- 
gorical approaches that have become part of the framework of 
blood pressure (JNC-V) and cholesterol (NCEP) programs in 
the United States. 67,10 As suggested in an earlier NCEP 
report, 27 our approach integrates blood pressure and choles- 
terol information and estimates both relative and absolute 
CHD risk with a risk factor weighting approach. . 

The NCEP ATP II guidelines defined hypertension as a 
yes/no variable, and it can be seen from Tables 3, 4, and 5 that 
additional blood pressure categories are important in predict- 



ing CHD risk. Higher levels of blood pressure are typically 
associated with abnormal cholesterol levels, greater body 
mass index, and an increased prevalence of diabetes (Table 
2). Data from Tables 3 and 4 demonstrate that blood pressure, 
TC, LDL-C, and HDL-C categories are predictive of CHD 
and suggest that risk factor prevention and intervention 
programs should be integrated, as recently suggested. 28 " 30 
Three reasons probably account for similar results when 
continuous or categorical formulations are used: (1) a large 
enough number of categories has been used to adequately 
describe the clinical data; (2) coronary prediction equations 
have limitations in their precision and accuracy; and (3) in the 
final steps of the prediction score sheet, the data are summa- 
rized, by use of point score totals, providing fewer than 20 
combinations for CHD risk prediction. 

The predictive capability of the continuous model de- 
scribed here is similar to the accelerated failure model used in 
an earlier Framingham CHD prediction equation, 11 and the 
continuous variable and categorical variable approaches have 
c- statistic values that are nearly identical, suggesting that 
predictability of the models is nearly the same in either 
instance. This result is in contradistinction to a comparison of 
the NCEP ATP II algorithm (<10 unique patterns) with a 
continuous variable approach in which the latter (using 
Framingham models) was thought to be statistically superi- 
or. 29 A risk factor sum model, considering 7 dichotomous 
variables, was used for comparison in the present study and 
showed a significant falloff in the level of the c statistic with 
this approach compared with formulations using categorical 
or continuous levels. 

TC- and LDL-C- based approaches, whether continuous or 
categorical variables are used, are similar in their ability to 
predict initial CHD events in the models presented. This may 
result from indirect estimation of LDL-C, leading to reduced 



Downloaded from circ.ahajournals.org by on October 16, 2007 



Wilson et al May 12, 1998 1843 



Years LDL Pta Choi Pis 

. ' -1 Ml; 

35-39 0 {0) 

45-40 .2 12) 

^:?>; : s644\ 3 [3] 

55-59 4 [4] 

65-69 6 [6} 

"-^,70-74^ 7 PJ 



Step2 



--C 



100-129 2.60-3.36 0 
130-150 337-4.14 "/ 0 
160-190 - 4 16-4.92 \ t 



>190 >4.92 " 2 



' ' Chotesteroi 'M. 



160-160 4.16-6.17 
200-239 5/tfM5.21 
24O.270V 6.22-7,24 ■ i 



Step 3 



[0] 

ii) 



(mmoVU LDL Pts Choi Pta 



, 35-44: ; OJM-1.16 1 ^ttift 
-.4549, , 1.17-1^0 . 0 10) 
" 50-50 1.30-1.55 6 fOj 



Step 7 



(sum from steps 1-6) 



c Adding up the polntt^ 



Age 



LOi^orCttolt ' •vj ^^'v^ sh 
HDL-C 

Blood - * v - '"^.V>3 



Point total 















Step 4 




















Systolic 




Diastolic (mm Hg) 




(mm Hg) 


<80 


60-84 


85-60 


90-99 


£100 


<120 












120-120 




0 101 pta 




llifl 




139-139 










140-150 








l2 : fjn:ots x 5 


• 


£160 










3[3)pts 



Note: VVheo &ys$o»c'end diastolic preeswes provide ditkwent 
estimates tor posnt scopes, use the fuo*»er nufftber 



Step 6 



(determine CHO risk from point total) 
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Step 9 



(compare to average person your age) 
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Step 6 



Smoker vi 
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LDL PtS Choi Pts 
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Color 


Relative Risk 
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Very low 


white 


LOW 
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High 




Very high 



* Hard CHO events exdude angina pectoris 

** Low. risk was calculated tor a person (he same 
age. optimal Wood pressure, LDL-C 100-129 mc/oX 
or cholesterol 160-199 mg/dl. HDL-C 45 mgAJL for 
men or 55 mcydL for women, non-smoker, no diabetes 

Risk estimates *ere oenved from !he experience dJ 
the ft yningnam Heart Study, a predornmanOy 
Caucasian population in Massachusetts. USA 



Figure 3. CHD score sheet for men using TC or LDL-C categories. Uses age, TC (or LDL-C), HDL-C, blood pressure, diabetes, and 
smoking. Estimates risk for CHD over a period of 1 0 years based on Framingham experience in men 30 to 74 years old at baseline. 
Average risk estimates are based on typical Framingham subjects, and estimates of idealized risk are based on optimal blood pressure, 
TC 160 to 199 mg/dL (or LDL 100 to 129 mg/dL), HDL-C of 45 mg/dL in men, no diabetes, and no smoking. Use of the LDL-C catego- 
ries is appropriate when fasting LDL-C measurements are available. Pts indicates points. 



accuracy and precision of LDL-C estimates from single blood 
measurements. 35,32 The CHD estimates in the present article 
represent the experience of a free-living population sample, 
and different results may be obtained when blood pressure or 
blood cholesterol has been treated aggressively. 

Although the impact of TC and LDL-C on estimates of CHD 
risk is similar in Framingham data, such results may be more 
relevant to populations than to individuals. Extensive clinical 
data and clinical trial results suggest that LDL-C is the major 
atherogenic lipoprotein and that measurement of LDL-C levels 
in the clinical setting provides an advantage. 33-35 High or low 



levels of HDL-C within individuals can produce discrepancies 
between TC and LDL-C levels. In addition, TC and LDL-C 
levels are not aJways concordant in persons with hypertriglyc- 
eridemia. Thus, measurement of TC is only a crude surrogate for 
LDL-C in risk assessment or in estimating initial response to 
therapy, although it can be useful in initial detection or long-term 
monitoring of response. 31 

Several candidate variables were not used in the predic- 
tion equations. A family history of premature CHD, 
previously shown in the Framingham Study to increase the 
relative odds of CHD to te 1.3, 36 was not uniformly 
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Step 7 



(sum tiom steps 1-6) 



{ determine CHD risk from point total) 
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(compare to average person your age) 
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Figure 4. CHD score sheet for women using TC or LDL-C categories. Uses age, TC, HDL-C, blood pressure, diabetes, and smoking. 
Estimates risk for CHD over a period of 10 years based on Framingham experience in women 30 to 74 years old at baseline. Average 
risk estimates are based on typical Framingham subjects, and estimates of idealized risk are based on optimal blood pressure, TC 160 
to 199 mg/dL (or LDL 100 to 129 mg/dL), HDL-C of 55 mg/dL in women, no diabetes, and no smoking. Use of the LDL-C categories is 
appropriate when fasting LDL-C measurements are available. Pts indicates points. 



available among the second-generation participants. Fi- 
brinogen is now recognized as a CHD risk factor, 37 and 
levels were available for ^1000 original cohort partici- 
pants at a 1968-70 examination, 3839 but fibrinogen mea- 
surements were not available for the Offspring Study 
participants. In addition, established methods for measur- 
ing fibrinogen are lacking, and the precise mechanism 
linking elevated fibrinogen levels to CHD is unclear. Other 
risk factors, such as smoking, diabetes, and hypertension, 
are often associated with abnormal fibrinogen levels, and 
Fibrinogen measurements vary greatly within individu- 
als. 3740 Left ventricular hypertrophy on the ECG was used 
in previous CHD prediction algorithms, but it is highly 
associated with hypertension and was not included in the 



present formulation for a variety of reasons, including lack 
of standard universally accepted ECG criteria. n 

Postmenopausal ERT was not used in the prediction 
algorithm, because estrogen dose was typically higher in the 
early 1970s 41 and the cardioprotective effects of hormonal 
replacement therapy that have been universally observed in 
more recent times 42 " 45 were not experienced by all Framing- 
ham women from the early 1970s to the mid 1980s. 46-48 

Persons who exercise typically have a lower risk of 
CHD. 49 ~ 51 Information on physical activity was not available 
at the baseline examinations used to develop this CHD risk 
prediction algorithm, but cigarette smoking, low HDL-C 
levels, and diabetes are less common among those who are 
physically active. 52 " 55 Regular and vigorous exercise is often 
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associated with higher levels of HDL-C, an important deter- 
minant for reduced CHD risk. 56 " 58 Similarly, body mass 
index, an obesity index that expresses weight in kilograms 
divided by height in meters squared, has been considered a 
candidate variable for the CHD prediction algorithm. Greater 
obesity has been associated with higher TC, lower HDL-C, 
higher blood pressure, and diabetes, and the residual impact 
of obesity on CHD has typically been slight after incorpora- 
tion of these other variables into the regression model. 8 

Clinicians should exercise caution in generalizing from 
experience of the Framingham Study, a community sample of 
white subjects drawn from a suburb west of Boston. Use of 
the prediction models would be most appropriate for individ- 
uals who resemble the study sample. However, reasonable 
accuracy in predicting CHD has been demonstrated in the 
past, when earlier Framingham CHD prediction equations 
were applied to population samples from Honolulu, Puerto 
Rico, Albany, Chicago, Los Angeles, Minneapolis, Tecum- 
seh, the Western Collaborative Group, and a national co- 
hort. 59 " 62 Follow-up from the Framingham Study was also 
used to estimate CHD experience in men participating in the 
Multiple Risk Factor Intervention Trial. 63 

Coronary prediction estimates tend to be most reliable 
when the data are most concentrated and can be particularly 
useful when subjects have multiple mild abnormalities that 
act synergistically to increase CHD risk. It is uncommon for 
persons to have four or five risk factors, and estimates of 
CHD risk tend to be more precise for individuals with fewer 
risk factors. Score sheet approaches have been used to target 
persons for the primary prevention of coronary disease by use 
of a tabular format called a Sheffield table, in which the 
estimated absolute risk for CHD is used to establish a 
threshold for aggressive intervention. 64 The average CHD 
rates reported in those tables are roughly comparable to the 
myocardial infarction and coronary death rates among mid- 
dle-aged men who participated in the West of Scotland trial 
of cholesterol lowering. 35,65 In contrast, our prediction equa- 
tions estimate coronary disease risk over a period of 10 years 
for a larger age range and include total CHD (angina pectoris, 
myocardial infarction, and coronary death). 

A study that considered CHD prediction using TC, LDL-C, 
TC/HDL-C ratio, and LDL-C/HDL-C ratio 66 concluded that 
"total cholesterol/HDL is a superior measure of risk for CHD 
compared with either total cholesterol or LDL cholesterol, 
and that current practice guidelines could be more efficient if 
risk stratification was based on this ratio rather than primarily 
on the LDL cholesterol level." Such an approach appears 
attractive, but at the extremes of the TC or LDL-C distribu- 
tion, equal ratios may not signify the same CHD risk. 
Moreover, use of a ratio may make it harder for the physician 
to focus on the separate values for TC, LDL-C, and HDL-C 
that have to be borne in mind to make appropriate clinical 
decisions concerning therapy. The current approach builds on 
established blood pressure (JNC-V) and cholesterol (NCEP 
ATP II) foundations, requires fasting samples only if LDL-C 
score sheets are used, and is easy to implement as part of a 
screening program. 

Estimation of CHD and other cardiovascular events is a 
dynamic field. The present formulation has attempted to provide 



TABLE 6. ^-Coefficients Underlying CHD Prediction Sheets 
Using TC Categories 



Variable 


Men 


Women 


Age, y 


0.04826 


0.33766 


Age squared, y 




-0.00268 


TC, mg/dL 






<160 


-0.65945 


-0.26138 


160-199 


Referent 


Referent 


200-239 


0.17692 


0.20771 


240-279 


0.50539 


0.24385 


>280 


0.65713 


0.53513 


HDL-C, mg/dL 






<35 


0.49744 


0.84312 


35-44 


0.24310 


0.37796 


45-49 


Referent 


0.19785 


50-59 


-0.05107 


Referent 


>60 


-0.48660 


-0.42951 








Optimal 


-0.00226 


-0.53363 


Normal 


. Referent 


Referent 


High normal 


0.28320 


-0.06773 


Stage I hypertension 


0.52168 


0.26288 


Stage IMV hypertension 


0.61859 


0.46573 


Diabetes 


0.42839 


0.59626 


Smoker 


0.52337 


0.29246 


Baseline survival function at 10 years, S(t) 


0.90015 


0.96246 



a simplified approach to predict risk for initial CHD events in 
outpatients free of disease, drawing on national programs for 
treatment of elevated blood pressure and TC, without a loss in 
accuracy. Other factors, such as fibrinogen, lipoprotein(a), ERT, 
family history of premature CHD, and hypertensive therapy 
have been or will be evaluated as baseline data and greater 
follow-up experience become available. 

Appendix 
Application of Tables 6 and 7 

The ^-coefficients given in Table 6 are used to compute a linear 
function. The latter is corrected for the averages of the participants* 
risk factors, and the subsequent result is exponentiated and used to 
calculate a 10-year probability of CHD after insertion into a survival 
function. The following explanation and an example treat each of 
these steps in a serial fashion, using Table 6 for the illustration 
below. 

(Equation 1): L^Chol^ = 0.04826 X age- 0.65945 (if cholesterol 
<160) +0.0 (if cholesterol 160 to 199) +0.17692 (if cholesterol 200 
to 239) +0.50539 (if cholesterol 240 to 279) +0.65713 (if choles- 
terol >280) +0.49744 (if HDL-C<35) +0.24310 (if HDL-C 35 to 
44) +0.0 (if HDL-C 45 to 49) -0.05107 (if HDL-C 50 to 59) 
-0.48660 (if HDL-C >60) -0.00226 (if blood pressure [BP] 
optimal) +0.0 (if BP normal) +0.28320 (if BP high normal) 
+0.52168 (if BP stage I hypertension) +0.61859 (if BP stage II 
hypertension) +0.42839 (if diabetes present) +0.0 (if diabetes not 
present) +0.52337 (if smoker) +0.0 (if not smoker). 

The function is evaluated at the values of the means for 
each variable. Call it G, where (Equation 1): G_ChoI mcn 
= 0. 04826 X 4 8. 5 926- 0.65945X0.0743 3 + 0.1 7692 X 
0.3885 1 + 0.50539X0.1 6673 +0.657 13XO.O5826 + 
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0. 497 44 X 0. 1 9 285 + 0.243 10 X0. 3 5476 -0.051 07 X 
0. 1 9646 -0. 48660 X.0. 10727 -0.0022 6X0.2004 8 + 
. 0.283 20 X 0.20048 + 0.52 168X0.22820 + 0.61 859 X 
0.13057+0.42839X0.05223+0.52337X0.40458 = 3.0975. Simi- 
larly, for women, G_Chol= 9.92545. For the LDL score sheets, 
G_LDL for men is 3.00069 and for women 9.914136. 

This value of G is subtracted from function L to produce function 
A (Equation 2), which is then exponentiated, to produce B (Equation 
3). The latter represents the relative odds for CHD. The survival 
value s(t) is exponentiated by B and subtracted from 1 .0 to calculate 
the 10-year probability of CHD (Equation 4). 

(Equation 2): A=L-G (where G_Chol=3.0975 for men, 
9.92545 for women; similarly for Table 7, G_LDL= 3.00069 for 
men, 9.914136 for women). 

(Equation 3): B=e A . 

(Equation 4): f > =l-[s(t)] B [where s(tLChol 10 years=0.90015 for 
men, 0.96246 for women; similarly for Table 7, s(t)_LDL 10 
years=0.90017 for men, 0.9628 for women]. . 

Consider a 55-year-old man with cholesterol of 250 mg/dL, HDL-C 
of 39 mg/dL, blood pressure (146/88 mm Hg) that falls into stage I 
hypertension, and no diabetes, who is a smoker. In this instance, after 
Equation 1, L= 55 X0.04826 +0.50539 +0.243 10+ 0.52 168 + 0.52337 
=4.4478. After Equation 2, A=4.4478-3.0975= 1.3503, and after 
Equation 3, B=e U5Q3 = 3.85874. Finally, after Equation 4, 
P= 1 -0.9001 5 3Jts874 = 1-0.66637 =0.3336, for a 33% chance of devel- 
oping CHD over 10 years. According to the point score sheet, 55 years 
old (4 points)+cholesterol of 250 mg/dL (2 points)+HDL-C of 39 
mg/dL (1 point)+ stage I blood pressure (2 points) + smoker (2 
points) = 1 1 points, corresponding to a 31% chance of developing CHD 
over 10 years. An average 55-year-old man has a 16% risk, and an ideal 
man has a 7% risk. Similar calculations can be done for women and for 
the LDL-C prediction models and score sheets. 



TABLE 7. /3-Coefficients Underlying CHD Prediction Sheets 
Using LDL-C Categories 



Variable 


Men 


Women 


Age, y 


0.04808 


0.33994 


Age squared, y 




-0.0027 


LDL-C, mg/dL 






<100 


-0.69281 


-0.42616 


100-129 


Referent 


Referent 


130-159 


0.00389 


0.01366 


160-189 


0.26755 


0.26948 


>190 


0.56705 


0.33251 


HDL-C, mg/dL 






<35 


0.48598 


0.88121 


35-44 


0.21643 


0.36312 


4&-49 


Referent 


0.19247 


50-59 


-0.04710 


Referent 


>60 


-0.34190 


-0.35404 


Blood pressure 






Optimal 


-0.02642 


-0.51204 


Normal 


Referent 


Referent 


High normal 


.0.30104 


-0.03484 


Stage I hypertension 


0.55714 


0.28533 


Stage IHV hypertension 


0.65107 


0.50403 


Diabetes 


0.42146 


0.61313 


Smoker 


0.54377 


0.29737 


Baseline survival function at 10 years, S(t) 


0.90017 


0.9628 
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Breast cancer exhibits familial aggregation, consistent with variation in genetic susceptibility to the disease. Known 
susceptibility genes account for less than 25% of the familial risk of breast cancer, and the residual genetic variance is likely 
to be due to variants conferring more moderate risks. To identify further susceptibility alleles, we conducted a two-stage 
genome-wide association study in 4,398 breast cancer cases and 4,316 controls, followed by a third stage in which 30 single 
nucleotide polymorphisms (SNPs) were tested for confirmation in 21,860 cases and 22,578 controls from 22 studies. We 
used 227,876 SNPs that were estimated to correlate with 77% of known common SNPs in Europeans at r 2 > 0.5. SNPs in five 
novel independent loci exhibited strong and consistent evidence of association with breast cancer (P < 10 7 ). Four of these 
contain plausible causative genes (FGFR2, TNRC9, MAP3K1 and LSP1). At the second stage, 1,792 SNPs were significant at the 
P < 0.05 level compared with an estimated 1,343 that would be expected by chance, indicating that many additional common 
susceptibility alleles may be identifiable by this approach. 



Breast cancer is about twice as common in the first-degree relatives of 
women with the disease as in the general population, consistent with 
variation in genetic susceptibility to the disease 1 . In the 1990s, two 
major susceptibility genes for breast cancer, BRCA1 and BRCA2, were 
identified 2 ' 3 . Inherited mutations in these genes lead to a high risk of 
breast and other cancers 4 . However, the majority of multiple case 
breast cancer families do not segregate mutations in these genes. 
Subsequent genetic linkage studies have, failed to identify further 
major breast cancer genes 5 . These observations have led to the pro- 
posal that breast cancer susceptibility is largely 'polygenic': that is, 
susceptibility is conferred by a large number of loci, each with a small 
effect on breast cancer risk 6 . This model is consistent with the ob- 
served patterns of familial aggregation of breast cancer 7 . However, 

Affiliations of the above authors are given at the end of the paper. 
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progress in identifying. the relevant loci has been slow. As linkage 
studies lack power to detect alleles with moderate effects on risk, large 
case- control association studies are required. Such studies have iden- 
tified variants in the DNA repair genes CHEK2, ATM, BR1P1 and 
PALB2 that confer an approximately twofold risk of breast cancer, 
but these variants are rare in the population 8 " 14 . A recent study has 
shown that a common coding variant in CASP8 is associated with a 
moderate reduction in breast cancer risk 15 . After accounting for all 
the known breast cancer loci, more than 75% of the familial risk of 
the disease remains unexplained 16 . 

Recent technological advances have provided platforms that allow 
hundreds of thousands of SNPs to be analysed in association studies, 
thus providing a basis for identifying moderate risk alleles without 
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prior knowledge of position or function. It has been estimated that 
there are 7 million common SNPs in the human genome (with minor 
allele frequency, m.a.f., >5%) 17 . However, because recombination 
tends to occur at distinct 'hot-spots', neighbouring polymorphisms 
are often strongly correlated (in 'linkage disequilibrium', LD) with 
each other. The majority of common genetic variants can therefore be 
evaluated for association using a few hundred thousand SNPs as tags 
for all the other variants 18 . We aimed to identify further breast cancer 
susceptibility loci in a three-stage association study 19 . In the first 
stage, we used a panel of 266,722 SNPs, selected to tag known com- 
mon variants across the entire genome 18 . These SNPs were genotyped 
in 408 breast cancer cases and 400 controls from the UK; data were 
analysed for 390 cases and 364 controls genotyped for ^80% of 
the SNPs. The cases were selected to have a strong family history of 
breast cancer, equivalent to at least two affected female first-degree 
relatives, because such cases are more likely to carry susceptibility 
alleles 20 . Initally, we analysed 227,876 SNPs (85%) with genotypes on 
at least 80% of the subjects. We estimate that these SNPs are corre- 
lated with 58% of common SNPs in the HapMap CEPH/CEU (Utah 
residents with ancestry from northern and western Europe) samples 
at ? > 0.8, and 77% at r 2 > 0.5 (mean r 2 = 0.75; see Supplementary 
Fig. 1) (http://www.hapmap.org/) 21 . As expected, coverage was 
strongly related to m.a.f.: 70% of SNPs with m.a.f. > 10% were tagged 
at r 2 > 0.8, compared with 23% of SNPs with m.a.f. 5-10%. The main 
analyses were restricted to 205,586 SNPs that had a call rate of 90% 
and whose genotype distributions did not differ from Hardy- 
Weinberg equilibrium in controls (at P< 10~ 5 ). 

For the second stage we selected 12,711 SNPs, approximately 5% of 
those typed in stage 1 , on the basis of the significance of the difference 
in genotype frequency between cases and controls. These SNPs were 
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Figure 1 1 Quantile-quantile plots for the test statistics (Cochran- 
Armitage 1 d.f. / 2 trend tests) for stages 1 and 2. a, Stage 1 ; b, stage 2. Black 
dots are the uncorrected test statistics. Red dots are the statistics corrected by 
ihe genomic control method (A = 1.03 for stage 1, 1 = 1.06 for stage 2). 
Under the null hypothesis of no association at any locus, the points would be 
expected to follow the black line. 



then genotyped in a further 3,990 invasive breast cancer cases and 
3,916 controls from the SEARCH study, using a custom-designed 
oligonucleotide array. In the main analyses, we considered 10,405 
SNPs with call rate of >95% that did not deviate from Hardy- 
Weinberg equilibrium in controls. 

Comparison of the observed and expected distribution of test stat- 
istics showed some evidence for an inflation of the test statistics in both 
stage 1 (inflation factor A = 1.03, 95% confidence interval (CI) 1.02- 
1 .04) and stage 2 (X = 1 .06, 95% CI 1.04-1 .12), based on the 90% least 
significant SNPs (Fig. 1). Possible explanations for this inflation 
include population stratification, cryptic relatedness among subjects, 
and differential genotype calling between cases and controls. There 
was evidence for an excess of low call rate SNPs among the most 
significant SNPs (P< 0.01) in stage 1, but not in stage 2, suggesting 
that some of this effect is a genotyping artefact (Supplementary Table 
1 ). However, the inflation was still present among SNPs with call rate 
>99% in both cases and controls, possibly reflecting population sub- 
structure. We computed 1 degree of freedom (d.f.) association tests for 
each SNP, combining stages 1 and 2. After adjustment for this inflation 
by the genomic control method 22 , we observed more associations than 
would have been expected by chance at P< 0.05 (Table 1). One SNP 
(dbSNP rs298 1 582) was significant at the P < 10~ 7 level that has been 
proposed as appropriate for genome-wide studies 23 . 

In the third stage, to establish whether any SNPs were definitely 
associated with risk, we tested 30 of the most significant SNPs in 22 
additional case-control studies, comprising 21,860 cases of invasive 
breast cancer, 988 cases of carcinoma in situ (CIS) and 22,578 controls 
(Supplementary Table 2). Six SNPs showed associations in stage 3 that 
were significant at P^ 10" 5 with effects in the same direction as in 
stages 1 and 2 (Table 2, Supplementary Table 3, and Fig. 2). All these 
SNPs reached a combined significance level of P < 10 7 (ranging from 
2 X 10" 76 to 3 X 10~ 9 ). Of these six SNPs, five were within genes or 
LD blocks containing genes. SNP rs2981582 lies in intron 2 of FGFR2 
(also known as CEK3), which encodes the fibroblast growth factor 
receptor 2. SNPs rsl2443621 and rs8051542 are both located in an 
LD block containing the 5' end of TNRC9 (also known as TOX3), a 
gene of uncertain function containing a tri- nucleotide repeat motif, as 
well as the hypothetical gene, LOC643714. SNP rs889312 lies in an LD 
block of approximately 280 kb that contains MAP3K1 (also known as 
MEKK) y which encodes the signalling protein mitogen-activated pro- 
tein kinase kinase kinase 1, in addition to two other genes: MGC33648 
and M1ER3. SNP rs3817198 lies in intron 10 of LSP1 (also known as 
WP43), encoding lymphocyte- specific protein 1, an F-actin bundling 
cytoskeletal protein expressed in haematopoietic and endothelial cells. 
A further SNP, rs2107425, located just HOkilobases (kb) from 
rs3817198, was also identified (overall P= 0.00002). rs2107425 is 
within the H19 gene, an imprinted maternally expressed untranslated 
messenger RN A closely involved in regulation of the insulin growth 
factor gene, 1GF2. In stage 3, however, rs2 107425 was only weakly 
significant after adjustment for rs3817198 by logistic regression 
(P= 0.06). This suggests that the association with breast cancer risk 
may be driven by variants in LSP1 rather than in H19. The sixth SNP 
reaching a combined P< 10~ 7 was rs!3281615, which lies on 8q. It is 
correlated with SNPs in a HOkb LD block that contains no known 

Table 1 1 Number of significant associations after stage 2 



Level of significance 


Observed 


Observed 
adjusted* 


Expected 


Ratio 


0.01-0.05 


1,239 


1,162 


934.3 


1.24 


0.001-0.01 


574 


517 


347.6 


1.49 


0.0001-0.001 • 


112 


88 


53.3 


1.65 


0.00001-0.0001 


16 


12 


7.0 


1.71 


<0.00001 


15 


13 


0.96 


13.5 


All P< 0.05 


1,956 


1,792 


1,343.2 


1.33 



Observed numbers of SNPs associated with breast cancer after stage 2, by level of significance, 
before and after adjustment for population stratification, and expected numbers under the null 
hypothesis of no association. 

* Adjusted for inflation of the test statistic by the genomic control method. 
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Table 2 | Summary of results for eleven SNPs selected for stage 3 that showed evidence of an association with breast cancer 

~— — — — Per allele OR HetOR HomOR P-trend 

(95% CI) 



rs Number 



Gene 



Position* 



m.a.f.t 



(95% CI) 



(95% CI) 



Stages 
1 and 2 



Stage3 



Combined 



5 x icr 62 

9X10" 14 . 
4 X 10~ 8 
3 X 10" 15 
1(T 5 
0.01 

6 X 10~ 7 



2 X 10"* 76 

2 X 1(T 19 

io- 12 

7 X IO" 20 

3 X 1(T 9 
2 X 10~ 5 
5X10" 12 



rs2981582 FGFR2 

rsl2443621 TNRC9/ 

LOC643714 



rs8051542 
rs889312 
rs3817198 LSP1 
rs2107425 H19 
rsl3281615 



TNRC9/ 
LOC643714 
MAP3K1 



lOq 

123342307 
16q 

51105538. 
16q 

51091668 
5q 

56067641 
Hp 

1865582 
lip 

1977651 
8q 

128424800 



0.38 

(030) 

0.46 

(0.60) 

0.44 

(020) 

0.28 

(0.54) 

0.30 

(0.14) 
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genes. The basis of this association therefore remains obscure. This 
SNP is approximately 130kb proximal to rs 1447295, 60 kb proximal 
to rs6983267 and 230 kb distal to rsl 690 1979, recently shown to be 
associated with prostate cancer 24-26 . 

In addition to the seven SNPs described above, there was evidence 
of association among the remaining 23 SNPs (global P= 0.001 in 
stage 3). In particular, three SNPs showed some evidence of asso- 
ciation in stage 3 (P< 0.05, in each case in the same direction as in 
stages 1 and 2; Table 2). SNPs rs981782 and rs30099 both lie in the 
centromeric region of chromosome 5. rs4666451 lies on 2p, a region 
for which some evidence of linkage to breast cancer in families has 
been reported 5 . The 20 other SNPs showed no evidence of association 
in stage 3 (global P = 0. 1 1 ), suggesting that most of these associations 
from stages 1 and 2 were false positives. 



FGFR2 

The most significantly associated SNP, rs298 1582, lies within a 25 kb LD 
block almost entirely within intron 2 of FGFR2. We found no evidence 
of association with SN Ps elsewhere in the gene ( Fig. 3a) . In an attempt to 
identify a causal variant, we first identified the 19 common variants 
(m.a.f. > 0.05) in this block from HapMap CEU data. These were tagged 
( ? > 0.8) by 7 SNPs including rs2981 582. The additional tag SNPs were 
genotyped in the SEARCH study cases and controls. Multiple logistic 
regression analysis of these variants found no additional evidence for 
association after adjusting for rs2981582. Haplotype analysis of these 7 
SNPs indicated that multiple haplotypes carrying the minor (a) allele of 
rs2981582 were associated with an increased risk of breast cancer, imply- 
ing that the association was being driven by rs2981582 itself or a variant 
strongly correlated with it (Supplementary Table 4); 




Stage 1 
Stage 2 
ABCFS 
KConFab/AOC 
MCCS 

SAsecs 
CNioecs 

OGPS 
GENICA 
HBCS 

^CCP 

KBCP 
LUMCBCS 

HBCS 
NCIPBCS 
SEA ROD 

sees 

MC8CS 

USRTS 
MEC-W 
European 

MEC-J 
TBCS 
SBCP 
Asian 

TOTAL 



1.0 1.2 1.4 1.6 1.8 0.8 1.0 1.2 1.4 1.6 1.8 0.8 1.0 1.2 1.4 1.6 1.8 



0.8 1.0 1-2 1.4 1.6 1JI 01 1.0 1.2 1.4 1.6 1.8 



Figure 2 | Forest plots of the per-allele odds ratios for each of the five SNPs 
reachinggenome-wide significance, a, rs2981 582; b, rs3803662; c, rs8893 1 2; 
d, rsl 328161 5; and e, rs3817198. The x-axis gives the per-allele odds ratio. 
Each row represents one study (see Supplementary Table 2), with summary 
odds ratios for all European and all Asian studies, and all studies combined. 



The area of the square for each study is proportional to the inverse of the 
variance of the estimate. Horizontal lines represent 95% confidence 
intervals. Diamonds represent the summary odds ratios, with 95% 
confidence intervals, based on the stage 3 studies only. 
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Resequencing of this region in 45 subjects of European origin 
identified 29 variants that were strongly correlated with rs2981582 
(r>0.6) (http://cgwb.nci.nih.gov; Fig. 3b and Supplementary 
Tables 5-8). A subset of 14 variants tagged 27 of these in European 
(r 2 >0.95) and Asian (Korean) samples {?> 0.86). Two variants 
could not be genotyped reliably. This new tagging set was then gen- 
otyped in SEARCH and 3 studies from Asian populations; the Asian 
studies were included because the LD is weaker, providing greater 
power to resolve the causal variant (Fig. 3b, left panel). The strongest 
association was found with rs7895676. On the assumption that there 
is a single disease-causing allele, we calculated a likelihood for each 
variant. 21 SNPs (including rs2981582) had a likelihood ratio of <1/ 
100 relative to rs7895676, indicating that none of these are likely to be 
the causal variant (Supplementary Table 8). Six variants were too 
strongly correlated for their individual effects to be separated using 
a genetic epidemiological approach. Functional assays will be 
required to determine which is causally related to breast cancer risk. 

Intron 2 of FGFR2 shows a high degree of conservation in mam- 
mals, and contains several putative transcription-factor binding sites 
(http://genomequebec.mcgill.ca/PReMod) 27 , some of which lie in 
close proximity to the relevant SNPs. We therefore speculate that 
the association with breast cancer risk is mediated through regulation 
of FGFR2 expression. Of possible relevance is that only three of these 
variants (rsl0736303, rs2981578 and rs35054928) are within 
sequences conserved across all placental mammals (Fig. 3c and 
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Supplementary Table 8). Of these, the disease associated allele of 
rsl0736303 generates a putative oestrogen receptor (ER) binding site. 
rs35054928 lies immediately adjacent to a perfect POU domain pro- 
tein octamer (Oct) binding site. However, multiple splice variants 
have been reported in FGFR2, and differential splicing might provide 
an alternative mechanism for the association. FGFR2 is a receptor 
tyrosine kinase that is amplified and overexpressed in 5-10% of 
breast tumours 28 " 30 . Somatic missense mutations of FGFR2 that are 
likely to be implicated in cancer development have also been demon- 
strated in primary tumours and cell lines of multiple tumour types 
(http://www.sanger.ac.ukygenetics/CGP/cosmic/) 30t3, . 

TNRC9/LOC643714 locus 

As two SNPs in the TNRC9ILOC643714 locus, rsl2443621 and 
rs805 1 542, both showed convincing evidence of association, we further 
evaluated this region by genotyping, in the SEARCH set, an additional 
19 SNPs tagging 101 common variants within the entire TNRC9 and 
LOC643714 genes, based on the HapMap CEU data. SNPs tagging the 
coding region of TNRC9 showed no evidence of association. The stron- 
gest association was observed with rs3803662, a synonymous coding 
SNP of LOC643714 that lies 8 kb upstream of TNRC9. This SNP was 
therefore genotyped in the stage 3 set (Table 2). Logistic regression 
analysis indicated that rs3803662 exhibited a stronger association with 
disease than other SNPs, and the associations with other SNPs were 
non-significant after adjustment for rs3803662. These results suggest 

^ Figure 3 | The FGFR2 locus, a, Map of the whole 

§ eg gj FGFR2 gene, viewed relative to common SNPs on 

HapMap. The gene is 1 26 kb long and in reverse 
3 '-5' orientation on chromosome 10. Exon 
positions are illustrated with respect to the 67 
SNPs with m.a.f. > 5% in HapMap CEU 
(therefore the map is not to physical scale). 
Numbered SNPs are those tested in the genome- 
wide study. SNPs in black were not significant in 
stage 1 . Those in red were significant at 
P < 0.0001 after stage 2. rsl05 10097 (orange) was 
significant in stage 1, but failed quality control in 
stage 2 owing to deviation from Hardy-Weinberg 
equilibrium. Squares indicate pairwise r 2 on a 
greyscale (black = 1, white = 0). Red circle 
indicates rs2981582. b, Resequenced 32 kb 
region, shown relative to SNPs in CEU with 
m.a.f. > 5%, showing pairwise LD for SNPs in 
HapMap CEU (left panel) and JPT/CHB (right 
panel). Red circle indicates rs2981582, shown in 
bold black, c, Sequence conservation of 32 kb 
region in five species, relative to human sequence 
(http://pipcline.lbl.gov/mcthods.shtml) 35 . Red 
circle indicates rs2981582. SNPs in grey are those 
used in the initial tagging of known common 
HapMap SNPs within the block SNPs in black 
are correlated with rs2981582 with r 2 > 0.6 in 
European samples. Six SNPs in red were those 
consistent with being the causative variant on the 
basis of the genetic data (not excluded at odds of 
100:1 relative to the SNP with the strongest 
association, rs7895676). 
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that the causal variant is closely correlated with rs3803662. Four SNPs 
in the HapMap CEU data (rsl7271951, rsl362548, rs3095604 and 
rs4784227) that span LOC643714 and the 5' regulatory regions of 
TNRC9 are strongly correlated with rs3803662, and it therefore 
remains unclear in which gene the causative variant lies. TNRC9 con- 
tains a putative HMG (high mobility group) box motif, suggesting that 
it might act as a transcription factor. 

Pattern of risks 

We assessed in more detail, in the stage 3 data, the pattern of the 
risks associated with the five independent SNPs that reached an over- 
all P<1(T 7 : rs2981582 (FGFR2), rs3803662 (TNRC9/LOC643714) y 
rs889312 (MAP3K1), rsl3281615 (8q) and rs3817198 (LSP1). For each 
of these five SNPs, the minor allele in Europeans was associated with an 
increased risk of breast cancer in a dose-dependent manner, with a 
higher risk of breast cancer in homozygous than in heterozygous car- 
riers. Simple dominant and recessive models could be rejected for each 
SNP (all P=0.02 or less). There was a marked difference in allele 
frequencies between populations, with the risk-associated alleles of 
rs8051542, rs889312 and rsl3281615 being the major allele in Asian 
populations. The per allele odds ratio associated with rs2981582 was 
significantly smaller, though still elevated, in the Asian versus European 
populations (P= 0.04 for difference in odds ratio). This difference is 
consistent with the hypothesis that rs2981582 is not the functional 
variant at the FGFR2 locus, and was not seen for SNPs exhibiting stron- 
ger evidence in the fine-scale mapping. No other evidence for hetero- 
geneity in the per-allele odds ratio among studies was observed (Fig. 2). 

Three of the SNPs (rs2981582, rs3803662 and rs889312) also 
showed evidence of association with breast CIS (Supplementary 
Table 9) . For rs2981582 and rs3803662, the estimated odds ratios were 
greater for a diagnosis of breast cancer before age 40 years, but the 
trends by age were not statistically significant (Supplementary Table 
10). There was evidence of an association with family history of breast 
cancer for three SNPs: for rs2981582 (P= 0.02), rs3803662 (P= 0.03) 
and rsl3281615 (P = 0.05), the susceptibility allele was commoner in 
women with a first-degree relative with the disease than in those 
without (Supplementary Table 11). rs2981582 was also associated 
with bilaterality (P= 0.02). The associations with family history and 
bilaterality are to be expected for susceptibility loci, and are similar to 
previous observations for alleles in CHEK2 and ATM (refs 10, 12, 14). 

Discussion 

This study has identified five novel breast cancer susceptibility loci, 
and demonstrated conclusively that some of the variation in breast 
cancer risk is due to common alleles. None of the loci we identified 
had been previously reported in association studies. Most previously 
identified breast cancer susceptibility genes are involved in DNA 
repair, and many association studies in breast cancer have concen- 
trated on genes in DNA repair and sex hormone synthesis and meta- 
bolism pathways. None of the associations reported here appear to 
relate to genes in these pathways. It is notable that three of the five loci 
contain genes related to control of cell growth or to cell signalling, but 
only one (FGFR2) had a clear prior relevance to breast cancer. These 
results should, therefore, open up new avenues for basic research. 

Our results emphasize the critical importance of study size in gen- 
etic association studies. It is notable that none of the confirmed asso- 
ciations reached genome-wide significance after stage 1 and only one 
reached this level after stage 2. As most common cancers have similar 
familial relative risks to breast cancer, it is likely that similarly large 
studies will be required to identify common alleles for other cancers. 
The fine-scale mapping of the FGFR2 locus demonstrates that, even 
with a clear association, identification of the causative variant can be 
extremely problematic. However, the use of studies from multiple 
populations with different patterns of LD can substantially reduce 
the number of variants that need to be subjected to functional analysis. 

As these susceptibility alleles are very common, a high proportion of 
the general population are carriers of at-risk genotypes. For example, 



approximately 14% of the UK population and 19% of UK breast ■ 
cancer cases are homozygous for the rare allele at rs2981582. On the 
other hand, the increased risks associated with these alleles are rela- 
tively small-— on the basis of UK population rates, the estimated breast 
cancer risk by age 70 years for rare homozygotes at rs298 1 582 is 1 0.5%, 
compared to 6.7% in heterozygotes and 5.5% in common homozy- 
gotes. At this stage, it is unlikely that these SNPs will be appropriate for 
predictive genetic testing, either alone or in combination with each 
other. However, as further susceptibility alleles are identified, a com- 
bination of such alleles together with other breast cancer risk factors 
may become sufficiently predictive to be important clinically. 

On the basis of the relative risk estimates from stage 3, and assuming 
that the five most significant loci interact multiplicatively on disease 
risk, these loci explain an estimated 3.6% of the excess familial risk of 
breast cancer. On the basis of our staged design and the estimated 
distribution of linkage disequilibrium between the typed SNPs and 
those in HapMap, we estimate that the power to identify the five most 
significant associations at P< 10" 7 (rs2981582, rs3803662, rs889312, 
rsl3281615 and rs3817198) was 93%, 71%, 25%, 3% and 1% respect- 
ively. These estimates are uncertain, notably because the true coverage 
of HapMap SNPs is unknown. Nevertheless, these calculations indicate 
that the power to detect the two strongest associations was high, and 
suggest that there are likely to be few other common variants with a 
similar effect on variation in breast cancer risk to rs2981582. In con- 
trast, the low power to detect rsl3281615 and rs3817198 suggests that 
these variants may represent a much larger class of loci, each explaining 
of the order of 0.1% of the familial risk of breast cancer. An example of 
such a locus is provided by CASP8 D302H, which showed strong 
evidence of association in a previous large study 15 . This SNP was tested 
in stage 1 , but the association was missed because it did not reach the 
threshold for testing in stage 2. The excess of associations after stage 2 is 
also consistent with the existence of many such loci. In addition, 
because the coverage for SNPs with m.a.f. < 10% was low, many low 
frequency alleles may have been missed. The detection of further sus- 
ceptibility loci will require genome-wide studies with more complete 
coverage and using larger numbers of cases and controls, together with 
the combination of results across multiple studies. The present study 
demonstrates that common susceptibility loci can be reliably iden- 
tified, and that they may together explain an appreciable fraction of 
the genetic variance in breast cancer risk. 

METHODS SUMMARY 

Cases for stage 1 were identified through clinical genetics centres in the UK and a 
national study of bilateral breast cancer. Cases in stage 2 were drawn from a 
population-based study of breast cancer (SEARCH) 32 . Controls for stages 2 and 3 
were drawn from EPIC-Norfolk, a population-based study of diet and cancer 33 . 

Cases and controls for stage 3 were identified through case-control studies in 
Europe, North America, South-East Asia and Australia participating in the 
Breast Cancer Association Consortium (Supplementary Table 2) 3 \ 

Genotyping for stages 1 and 2 was conducted using high-density oligonucleo- 
tide microarrays. For the main analyses, we excluded samples called on <80% of 
SNPs in either stage. We also excluded SNPs that achieved a call rate of ^90% in 
stage 1 and <95% in stage 2, and SNPs whose frequency deviated from Hardy- 
Weinberg equilibrium in controls at P < 0.00001 . Genotyping for stage 3, and for. 
the fine-scale mapping of the FGFR2 locus, was conducted using either a 5' 
nuclease assay (Taqman, Applied Biosystems) or MALDI-TOF mass spectro- 
metry using the Sequenom iPLEX system. For each centre, we excluded any 
sample called on <80% of SNPs, and any SNP with a call rate of ^95% or a 
deviation from Hardy-Weinberg equilibrium in controls at P< 0.00001. Tests 
of association were 1 d.f. Cochran-Armitage tests, stratified for stage, centre and 
ethnic group (European or Asian). Odds ratios for each SNP were estimated 
using stratified logistic regression, using the stage 3 data only. 

Full Methods and any associated references are available in the online version of 
the paper at www.nature.com/nature. 
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METHODS 

Subjects. Cases in stage 1 were identified through clinical genetics centres in 
Cambridge (n = 91), Manchester (96) and Southampton (136), and a national 
study of bilateral breast cancer (85). Cases were women diagnosed with invasive 
breast cancer under the age of 60 years who had a family history score of at least 2, 
where the score was computed as the total number of first-degree relatives plus 
half the number of second-degree relatives affected with breast cancer. The score 
for women with bilateral breast cancer was increased by 1, so that women were 
eligible if they were diagnosed with bilateral breast cancer and had one affected 
first-degree relative. Cases known to carry a BRCA1 or BRCA2 mutation were 
excluded. Controls were selected from the EPIC-Norfolk study, a population- 
based cohort study of diet and cancer based in Norfolk, East Anglia, UK 33 . 
Controls were chosen to be women aged over 50 years and free of cancer at 
the time of entry. Genotyping was attempted on 408 cases, plus 32 duplicate 
case samples, and 400 controls. For the analysis in Table 1, 54 samples with 
genotype call rates <80% were excluded, so the final analyses were based on. 
390 cases and 364 controls. The minimum genotype call rate for the remaining 
samples was 89%. The overall genotype discordance rate between duplicate 
samples in stage 1 was 0.01%. 

For stage 2, invasive breast cancer cases were drawn from SEARCH, a popu- 
lation-based study of cancer in East Anglia 32 . Controls were women selected 
from the EPIC-Norfolk study, as previously described 33 . Eighty-eight subjects 
who were also genotyped in stage 1, and 35 controls who subsequently developed 
breast cancer and were also in the case series, were excluded from the analysis, 
leaving 3,990 breast cancer cases and 3,916 controls, plus five duplicates. The 
overall rate of discordance of genotypes between duplicate samples in stage 2 was 
0.008%. 

Twenty-one additional studies were included in stage 3 (see Supplementary 
Table 2). These studies participated through the Breast Cancer Association 
Consortium, an ongoing collaboration among investigators conducting case- 
control association studies in breast cancer 15 - 33 . All studies provided information 
on disease status (invasive breast cancer, carcinoma in situ or control), age at 
diagnosis/observation, ethnic group, first-degree family history of breast cancer 
and bilaterality of breast cancer. One further study (Breast Cancer Study of 
Taiwan) was included in the fine-scale mapping of the FGFR2 locus. 
Genotyping. For stage 1, genotyping was performed on 200 ng DNA that was 
first subjected to whole genome amplification using Multiple Displacement 
Amplification (MDA) 36 . Samples were then genotyped for a set of 266,732 
SNPs using high-density oligonucleotide, photolithographic microarrays at 
Perlegen Sciences. For stage 2, genotyping was performed using 2.5 ug genomic 
DNA. These samples were genotyped for a set of 13,023 SNPs selected on the 
basis of the stage 1 results, using a custom designed oligonucleotide array. For 
both stages, each SNP was interrogated by 24 25-mer oligonucleotide probes 
synthesized by photolithography on a glass substrate. The 24 features comprise 4 
sets of 6 features interrogating the neighbourhoods of SNP reference and alterna- 
tive alleles on forward and reference strands. Each allele and strand is represented 
by five offsets: -2, - 1, 0, 1 and 2 indicating the position of the SNP within the 
25-mer, with zero being at the thirteenth base. At offset 0 a quartet was tiled, 
which included the perfect match to reference and alternative SNP alleles, and 
the two remaining nucleotides as mismatch probes. When possible, the mis- 
match features were selected as a purine nucleotide substitution for a purine 
perfect match nucleotide and a pyrimidine nucleotide substitution for a pyri- 
midine perfect match nucleotide. Thus, each strand and allele tiling consisted of 
6 features comprising five perfect match probes and one mismatch. 

Individual genotypes were determined by clustering all SNP scans in the two- 
dimensional space defined by reference and alternative trimmed mean intens- 
ities, corrected for background. Allele frequencies were approximated using the 
intensities collected from the high-density oligonucleotide arrays. An SNP's 
allele frequency, p, was estimated as the ratio of the relative amount of the 
DNA with reference allele to the total amount of DNA. The p value was com- 
puted from the trimmed mean intensities of perfect match features, after sub- 
tracting a measure of background computed from trimmed means of intensities 
of mismatch features. The trimmed mean disregarded the highest and the lowest 
intensity from the five perfect match intensities before computing the arithmetic 
mean. For the mismatch features, the trimmed mean is the individual intensity of 
the specified mismatch feature. 

The genotype clustering procedure was an iterative algorithm developed as a 
combination of K-means and constrained multiple linear regressions. The 
K-means at each step re-evaluated the cluster membership representing distinct 
diploid genotypes. The multiple linear regressions minimized the variance in p 
within each cluster while optimizing the regression lines' common intersect. The 
common intersect defined a measure of common background that was used to 
adjust the allele frequencies for the next step of K-means. The K-means and 
multiple linear regression steps were iterated until the cluster membership and 



background estimates converged. The best number of clusters was selected by 
maximizing the total likelihood over the possible cluster counts of 1, 2 and 3 
(representing the combinations of the three possible diploid genotypes). The 
total likelihood was composed of data likelihood and model likelihood. The data 
likelihood was determined using a normal mixture model for the distribution of 
p around the cluster means. The model likelihood was calculated using a prior 
distribution of expected cluster positions, resulting in optimal p positions of 0.8 
for the homozygous reference cluster, 0.5 for the heterozygous cluster and 0.2 for 
the homozygous alternative cluster. 

A genotyping quality metric was compiled for each genotype from 15 input 
metrics that described the quality of the SNP and the genotype. The genotyping 
iquality metric correlated with a probability of having a discordant call between 
the Perlegen platform and outside genotyping platforms (that is, non-Perlegen 
HapMap project genotypes). A system of 10 bootstrap aggregated regression, 
trees was trained using an independent data set of concordance data between 
Perlegen genotypes and HapMap project genotypes. The trained predictor was 
then used to predict the genotyping quality for each of the genotypes in this data 
set. Genotypes with quality scores of less than 7 were discarded. Data were 
analysed for 227,876 SNPs in stage 1 and 12,026 (of 13,023 selected) in stage 
2, for which the call rate was >80%. 

The 12,71 1 SNPs for stage 2 were primarily selected on the basis of a 1 d.f. 
Cochran-Armitage trend test ( 11,809, all with P < 0.052). We also included 826 
SNPs with P < 0.01 testing for the difference in frequency of either homozygote 
between cases and controls (that is, assuming either a dominant or recessive 
model) and 76 SNPs that achieved P < 0.01 on a Cochran-Armitage test, weight- ■ 
ing individuals by their family history score as above. 

For the main analyses, we discarded SNPs with a call rate <90% in stage 1 and 
95% in stage 2, and SNPs with a deviation from Hardy-Weinberg equilibrium 
significant at P< 0.00001 in either stage, leaving 205,586 SNPs in stage 1 and 
10,621 SNPs in stage 2. 

The 30 SNPs included in the stage 3 analyses were initially selected on the basis 
of a combined analysis of stage 1 and stage 2. We included all SNPs achieving a 
combined P< 0.00002 (based on either the Cochran-Armitage or 2 d.f. test, see 
below). Following re-evaluation of the stage 2 genotyping by 5' nuclease assay 
(Taqman, Applied Biosystems) using the AB1 PRISM 7900HT (Applied 
Biosystems), and exclusion of some samples, 16 of these SNPs were significant 
at P< 0.00002 and 24 at P< 0.0002 (Supplementary Table 3). One additiopal 
SNP, rs3803662, was added as a result of fine-scale mapping of the TNRC91 
LOC6437M locus. 

The 31 stage 3 SNPs were genotyped in 22 studies (including cases and con- 
trols from SEARCH not used in stage 2, together with 21 other studies). For 18 of 
the studies, genotyping was performed by 5' nuclease assay (Taqman) using the 
ABI PRISM 7900HT or 7500 Sequence Detection Systems according to manu- 
facturer's instructions. Primers and probes were supplied directly by Applied 
Biosystems (http://www.appliedbiosystems.com/) as Assays-by-Design. All 
assays were carried out in 384 r well or 96-well format, with each plate including 
negative controls (with no DNA). Duplicate genotypes were provided for at least 
2% of samples in each study. For three studies, SNPs were genotyped using 
matrix assisted laser desorption/ionization time of flight mass spectrometry 
(MALDl-TOF MS) for the determination of allele-specific primer extension 
products using Sequenom's MassARRAY system and iPLEX technology. The 
design of oligonucleotides was carried out according to the guidelines of 
Sequenom and performed using MassARRAY Assay Design software (version 
1.0). Multiplex PCR amplification of amplicons containing SNPs of interest was 
performed using Qiagen HotStart Taq Polymerase on a Perkin Elmer GeneAmp 
2400 thermal cycler (MI Research) with 5 ng genomic DNA. Primer extension 
reactions were carried out according to manufacturer's instructions for iPLEX 
chemistry. Assay data were analysed using Sequenom TYPER software (version 
3.0). One study used both the Taqman and MALDl-TOF MS approaches. The 
SNPs genotyped in stage 3 were also regenotyped in the stage 2 samples using 
Taqman; these genotype calls were used in the overall analyses (Table 2, 
Supplementary Table 3, and Fig. 2). 

We eliminated any sample that could not be scored on 20% of the SNPs 
attempted. We also removed data for any centre/SNP combination for which 
the call rate was less than 90%. In any instances where the call rate was 90-95%, 
the clustering of genotype calls was re-evaluated by an independent observer to 
determine whether the clustering was sufficiently clear for inclusion. We also 
eliminated all the data for a given SNP/centre where the reproducibility in 
duplicate samples was <97%, or where there was marked deviation from 
Hardy-Weinberg equilibrium in the controls (P < 0.00001 ). 
Fine-scale mapping of FGFR2. Initial tagging of the associated region was done 
by identifying all SNPs with an m.a.f.>5% in the HapMap CEPH/CEU set 
(Utah residents with ancestry from northern and western Europe). We then 
selected 7 SNPs (in addition to rs2981582) that tagged these variants with a 
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pairwise r 2 > 0.8, using the program Tagger (http://www.broad.mit.edu/mpg/ 
tagger/) 37 . To identify additional common variants within the 32.5 kb region of 
linkage around the associated SNP, we resequenced 45 lymphocyte DNA samples 
from a subset of European subjects also genotyped by HapMap and other pub- 
licly available data sets. Seventy overlapping PCR amplicons were designed from 
positions 123317613 to 123348192 of chromosome 10 (average amplicon size 
650 bp, 160 bp overlap). M13-tagged PCR products were bidirectionally 
sequenced using Big Dye 3.0 (Applied Biosystems) and processed using auto- 
mated trace analysis through the Cancer Genome Workbench (cgwb.nci.nih.- 
gov). Eighty-six per cent of the nucleotides across the region could be scored for 
polymorphisms in at least 80% of subjects. This set gave a >97% probability of 
detecting a variant with an rn.ai. > 5%. One hundred and seventeen variants 
were identified, including 27 present in dbSNP but without individual genotype 
information in European subjects, and an additional 46 not in dbSNP. 
Individual genotype information was then compared and merged with publicly 
available genotypes from Caucasian subjects (HapMap release 21 for 60 CEU 
parents, 22 European subjects from the Environmental Genome Project (EGP) 
resequencing effort (http://egp.gs.washington.edu/data/fgfr2/), and 24 Euro- 
pean subjects from Perlegen (retrieved through http://gvs.gs.washington.edu/ 
GVS)). There were 2 discrepancies among 389 genotype calls among subjects 
in common between our resequencing effort and EGP or Perlegen data, and 10 
out of 926 compared to HapMap genotypes. 

On the basis of these data, we identified 28 SNPs correlated with rs2981582 
with r 2 > 0.6. We then attempted to genotype these 28 SNPs, plus rs2981582, in a 
subset of 80 controls from SEARCH and 84 controls from the Seoul Breast 
Cancer Study. Twenty-two of the variants were genotyped using Taqman. 
Four further variants (rs34032268, rs2912778, rs2912781 and rs7895676), which 
were not amenable to Taqman, were genotyped by PyTOsequencing (Biotage; 
http://www.biotagebio.com/). Assays were designed using Pyrosequencing 
Assay Design Software 1.0. The remaining 2 SNPs (rs35393331 and 
rs33971856) could not be genotyped using either technology and were excluded 
from further analyses. We cannot therefore comment on their likelihood of being 
the causal variant. Using these data, we selected tagging sets of 1 1 SNPs for UK 
subjects and 14 SNPs for Korean subjects (including rs298 1582), such that each 
of the remaining variants was correlated with a tagging SNP with r 2 > 0.95 in the 
UK study or r 2 > 0.86 in the Korean study. After genolyping the 1 1 tag SNPs in 
SEARCH, two of these SNPs (rs4752569 and rs35012336) showed strong evid- 
ence against being the causative variant and were not considered further. The 
remaining 12 tag SNPs from, the Korean subset were then genotyped in the 
samples from the lARC-Thai Breast Cancer Study, the Breast Cancer Study in 
Taiwan and the Multi-Ethnic Cohort (MEC), by Taqman. 
Statistical methods. The primary test used for each SNP was a Cochran- 
Armitage 1 d.f. score test for association between disease status and allele dose. 
In the combined analysis, we performed a stratified Cochran -Armitage test. 
Stage 1 was given a weight of 4 in this analysis (corresponding to a weight of 2 
in the score statistic), to allow for the expected greater effect size given the 
inclusion of cases with a family history-. In the stage 3 analyses, each study was 
treated as a separate stratum, except for the MEC, in which the European 
American and Japanese American subgroups were treated as separate strata. 
For all studies except the MEC, individuals from a minor ethnic group for that 
study were excluded. Per-allele and genotype-specific odds ratios, and confid- 
ence intervals, were estimated using logistic regression, adjusting for the same 
strata. The summary odds ratios in Fig. 2 are based on the data from the stage 3 
studies only, to avoid the bias inherent in estimates from the stage 1 and 2 data 
for SNPs exhibiting an association (the so called 'winner's curse'). The effects of 
genotype on family history of breast cancer (first degree yes/no) and bilaterality 
were examined by treating these variables as outcomes in a stratified Cochran- 
Armitage test. 

To assess the global significance of the SNPs in stage 3, we computed the sum 
of the x 2 trend statistics (excluding the 6 SNPs reaching genome-wide signifi- 
cance, plus rs2107425 as it was in LD with rs3817198) over those SNPs (17 of 23) 
for which the estimated odds ratios in stage 3 were in the same direction as the 
combined stage 1 /stage 2 38 . Under the null hypothesis of no association, the 
asymptotic distribution of this statistic is /} with n degrees of freedom, where 
n has a binomial distribution with parameters 23 and 1 12. The significance of this 
statistic was then assessed by computing a weighted sum of the tails of the 
relevant y} distributions. 

For the fine-scale mapping of the FGFR2 locus, we first derived haplotype 
frequencies using the haplo.stats package in S-plus 39 , separately for the European 
and Asian populations, using data from the case-control studies on whom the tag 
SNPs were typed plus the 164 control individuals on whom all SNPs were typed. 
These were used to impute genotype probabilities for each identified SNP. in each 
individual. We then used an EM algorithm to fit a logistic regression model 
assuming that each SNP in tum was the causal variant, allowing for uncertainty 



in the genotypes of untyped SNPs, and hence to determine the likelihood that . 
each SN P was the causal variant. 

Coverage of the stage 1 tagging set was estimated using HapMap phase II as a 
reference. We based estimates on 2,116,183 SNPs with an m.a.f. of >5% in the 
CEU population. Of the SNPs successfully genotyped in stage 1, 187,663 were 
also on HapMap. For those SNPs not on HapMap, we identified 'surrogate SNPs 
that were in perfect LD based on genotyping of 24 Caucasians by Perlegen 
Sciences (269,203 SNPs) 18 . To estimate coverage, we determined the best pair- 
wise r 2 for each HapMap SNP and each tag SNP or a surrogate SNP, using the 
HapMap CEU data. This coverage was summarized in terms of the distribution 
of r 2 by allele frequency in 10 categories! 

To estimate the power to detect each of the associations found, we computed 
the non-centrality parameter for the test statistic at each stage, based on the per- 
allele relative risk, allele frequency and r 2 . This was used to estimate the power for 
a given r 2 , based on a simulated trivariate normal distribution for the score 
statistics after each stage to allow for the correlations in the test statistics. We 
assumed a cut-off of P < 0.05 for stage 1, P< 0.00002 for stage 2 and P< 10~ 7 
for stage 3 (the first is slightly conservative, as more SNPs than this were actually 
taken forward). The overall power was obtained by averaging the power esti- 
mates for each r over the distribution of r 2 obtained from the HapMap data, 
applicable to a SNP of that frequency. 

The expected number of significant associations after stage 2 (Table 1) was 
calculated using a bivariate normal distribution for the joint distribution of the 
(weighted) Cochran-Armitage score statistics after stage 1 and after both stages, 
using a correlation of 0.525 between the two statistics (reflecting the weighted 
sizes of the two studies). These calculations were based on the 205,586 SNPs 
reaching the required quality control in stage 1. Of these, 11,313 reached a 
P<0.05, of which 7,405 (65.5%) were successfully genotyped to the required 
quality control in stage 2. Thus the expected number reaching a given signifi- 
cance level with good quality control was calculated from the total number 
expected to reach this level X 65.5%. We adjusted the variances of the test 
statistics, separately for stages 1 and 2, using the genomic control method". 
The adjustment factor, A, was estimated from the median of the smallest 90% 
of the test statistics for SNPs typed in that stage, divided by the predicted median 
for the smallest 90% of a sample of * 2 i distributions (that is, the 45% percentile 
of a x 2 i distribution, 0.375). 

36. Dean, F. B. et at. Comprehensive human genome amplification using 
multiple displacement amplification. Proc. Natl Acad. Set. USA 99, 5261-5266 
(2002). 

37. de Bakker, P. I. W. et ai Efficiency and power in genetic association studies. Nature 
Genet. 37, 1217-1223 (2005). 

38. Tyrer, J., Pharoah, P. D. P. & Easton, D. F. The admixture maximum likelihood test: 
A novel experiment-wise test of association between disease and multiple SNPs. 
Genet. Epidemiol 30, 636-643 (2006). 

39. Schaid, D. L. Rowland, C. M., Tines, D. E., Jacobson, R. M. & Poland, G. A. Score 
tests for association between traits and haplotypes when linkage phase is 
ambiguous. Am. J. Hum: Genet. 70, 425-434 (2002). 
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I Two variants on chromosome 17 confer prostate cancer 
| risk, and the one in TCF2 protects against type 2 diabetes 
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o We performed a genome-wide association scan to search for 
© sequence variants conferring risk of prostate cancer using 
1,501 Icelandic men with prostate cancer and 11,290 controls. 
Follow-up studies involving three additional case-control 
^roups replicated an association of two variants on 
chromosome 17 with the disease. These two variants, 33 Mb 
apart, fall within a region previously implicated by family- 
based linkage studies on prostate cancer. The risks conferred 



by these variants are moderate individually (allele odds ratio 
of about 1 .20), but because they are common, their joint 
population attributable risk is substantial. One of the variants is 
in TCF2 (HNFIff), a gene known to be mutated in individuals 
with maturity-onset diabetes of the young type 5. Results from 
eight case-control groups, including one West African and one 
Chinese, demonstrate that this variant confers protection 
against type 2 diabetes. 
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Figure 1 A schematic view of the genome-wide association results for chromosome 17q. Shown are 
results from the genome-wide association analysis performed in the Icelandic study population. The 
results plotted are for all lllumina Hap300 chip SNPs that are located between position 30 Mb and the 
telomere (-78.6 Mb; build 35) on the long arm of chromosome 17 (blue diamonds). The six SNP 
markers circled in red and listed in Table 2 all fall within the linkage region described in ref. 8. 



*2 a 

CO 

Z 

8 



Firmly established risk factors for prostate cancer are age, ethnicity 
and family history. Despite a large body of evidence for a genetic 
component to the risk of prostate cancer, sequence variants on 8q24 
are the only common variants reported so far that account for 
substantial proportion of cases 1 " 4 . 

In the present study, we began with a genome-wide SNP association 
study, applying 310,520 SNPs from the IUumina Hap300 chip to 
search for sequence variants conferring risk of prostate cancer using 
Icelandic cases and controls. We expanded the data from a previously 
reported study 2 by increasing the number of cases from 1,453 to 1,501 
and the number of controls from 3,064 to 11,290. This corresponds to 
34% increase in effective sample size. Apart from the variants on 
8q24 (refs. 1,2) and SNPs correlated with them, no other SNPs 
achieved genome- wide significance (Supplementary Fig. 1 online). 
However, we assumed that a properly designed follow-up strategy 
would lead to the identification of additional susceptibility variants for 
prostate cancer. 

Like others 5 , we believe that results from family-based linkage 
udies should be taken into account when evaluating the association 
results of a genome-wide study. However, instead of using linkage 
scores to formally weight the statistical significance of different SNPs 5 , 
we used them to prioritize follow-up studies. The long arm of 
chromosome 17 has been reported in several linkage studies of 
prostate cancer 6- *, but no susceptibility variants have yet been 
found 9-11 . Hence, we decided to focus on this region first. 

We selected for further analysis six SNPs on chromosome 17q 
having the lowest P values (<5 x 10" 4 ) and ranking from 68 to 100 
among the most significantly associated SNPs in our geriome-wide 
analysis (Fig. 1). These SNPs mapped to two 
distinct regions on chromosome 17q that are 
both within a region with LOD scores ranging 
from 1-2 but outside the proposed 10-cM 
candidate gene region reported in a recent 
linkage analysis 8 . One locus was on 17ql2 
(rs7501939 and rs3760511), encompassing 
the 5' end of the TCF2 (HNFlp) gene, 
where the linkage disequilibrium (LD) is 
weak (based on the Utah CEPH (CEU) Hap- 
Map data set). The second locus is in a gene- 
poor area on 17q24.3 (rsl 859962, rs7214479, 
rs6501455 and rs983085) where all four SNPs 



fall within a strong LD block (based on the 
CEU HapMap data set). The two loci are 
separated by approximately 33 Mb, and 
we did not observe any LD between them 
(see Supplementary Table 1 online for r 2 and 
D / values). 

We genotyped five of the six SNPs in three 
prostate cancer case-control groups of Euro- 
pean ancestry (Table 1). The assay for 
rs983085 on 17q24.3 failed in genotyping, 
but this SNP is almost perfectly correlated 
with rs6501455 (r 2 = 0.99) and is therefore 
expected to give comparable results. For each 
of the replication study groups, the observed 
effect of four of the five SNPs were in the 
same direction as in Iceland. One SNP, 
rs6501455, showed an opposite effect in the 
Chicago group. When results from all four 

case-control groups were combined, two 

SNPs achieved genome-wide significance, 
rs7501939 allele C (rs7501939 C) at 17ql2 (odds ratio (OR) = 1.19, 
P = 4.7 x 10- 9 ) and rsl859962 allele G (rsl859962 G) at 17q24.3 
(OR = 1.20, P = 2.5 x 1(T 10 ) (Tables 2 and 3). In an effort to refine 
the signal at the 17ql2 locus, we selected three SNPs (rs4239217, 
rs757210, rs4430796) that were substantially correlated with rs7501939 
(r 2 > 0.5) based on the CEU HapMap data. One of these, rs4430796, 
showed an association to prostate cancer that was stronger than that of 
rs7501939. Specifically, with all groups, combined, allele A of 
rs4430796 had an OR of 1.22 with a P of 1.4 x l(T n (Table 2). 
A joint analysis showed that the effects of rs7501939 and rs3760511 
were no longer significant after adjusting for rs4430796 (P = 0.88 and 
0.58, respectively), whereas rs4430796 remained significant after 
adjusting for both rs7501939 and rs3760511 (P = 0.0042). At 
17q24.3, our attempt at refining the signal did not result in any 
SNP that was more significant than rsl 859962. Among the lllumina 
SNPs, rs71 14479 and rs6501455 were not significant (P > 0.75) with 
adjustment for the effect of rsl 859962, whereas rsl 859962 remained 
significant after adjusting for the other two SNPs (P = 7.4 x \<T*). 
Henceforth, our focus was on rs4430796 at 17ql2 and rsl859962 at 
17q24.3; However, at 17ql2, because rs7501539 was a part of the 
original genome-wide scan, we have included it in the discussion when 
appropriate. For replication efforts, we recommend including at least 
the three abovementioned SNPs. We note that in the results released 
by the Cancer Genetic Markers of Susceptibility study group (see URL 
below), these three SNPs also show nominal, but not genome-wide, 
significant association with prostate cancer. 

For men with prostate cancer diagnosed at age 65 or younger, 
the observed OR from the combined analysis was slightly higher (1.30 



Table 1 Characteristics of men with prostate cancer and controls from four sources 





Affected 






Mean age at 


Age at diagnosis 


Study population 


individuals 


Controls 


Aggressive 3 (%) 


diagnosis (range) 


<65 years (%) 


Iceland 


1,501 


11,290 


50 


70.8 (40-96) 


22 


Nijmegen, The Netherlands 


999 


1,466 


47 


64.2(43^83) 


52 


Zaragoza, Spain 


456 


1,078 


37 


69.3 (44-83) 


19 


Chicago 


537 


514 


48 


59.6 (39^87) 


70 


Total: 


3,493 


14,348 









"'Aggressive' is defined here as cancers with Gleason scores of 7 or higher and/or a stage of T3 or higher and/or node-positive 
disease and/or metastatic c ' 
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Table 2 Association results for SNPs on 17ql2 and prostate cancer in Icel and, The Netherlands, Spain and the US 

~ ■ ■ Frequency 



Study population {N cases//V controls) and variant (allele) 



Controls 



OR (95% c.i.) 



P value 



Iceland (1,501/11,289) 

rs7501939 (C) 
rs3760511 (C) 
rs4430796 (A) 


0.615 
0.384 
0.558 


0.578 
0.348 
0.51Z 


1.17 (1.08-1.27) 
1.17 (1.08-1.27) 
1.20(1.11-1.31) 


1.8 x 1(H 
1.6 x lO 4 
1.4 x 10" 5 


co The Netherlands (997/1,464) 

f rs7501939 (C) 


0.648 


0.589 


1.29 (1.15-1.45) 


2,4 x lO" 5 


g rs3760511(C) 
g* rs4430796 (A) 


0.362 
0.568 


0.338 
0.508 


1.11 (0.99-1.25) 
1.28(1.14-1.43) 


0.086 
3.1 x lO" 5 


| Spain (456/1,078) 

c rs7501939 (C) 
§ rs37605 11 (C) 

8 rtiA&myQfi (A) 


0.583 
0.277 
0.469 


0.566 
0.257 
0.454 


1.07 (0.92-1.26) 
1.11 (0.93-1.32) 
1.06(0.91-1.24) 


0.37 
0.25 
0.45 


§ Chicago (536/514) 
| rs7501939(C) 
| rs3760511 (C) 
J rs4430796(A) 


0.637 
0.347 
0.563 


0.588 
0.294 
0.477 


1.15(1.03-1.47) 
1.28(1.06-1.54) 
1.41 (1.19-1.67) 


0.021 
9.4 x 10~ 3 
9.4 x 10" 5 


§ All excluding Iceland (1,989/3,056)" 
g rs7501939(C) 
rs3760511 (C) 
3 rs4430796(A) 




0.581 
0.296 
0.480 . 


1.21 (1.12-1.32) 
1.15 (1.05-1.25) 
1.24(1.14-1.35) 


5.6 x lfT 6 
2.4 x lO" 3 
2.0 x lO" 7 


S All combined (3, 490/1 4,345)" 

^ rs7501939(C) 
£ rs3760511 (C) 
« rs4430796 (A) 




0.580 
0.309 
0.488 


1.19(1.12-1.26) 
1.16(1.09-1.23) 
1.22 (1.15-1.30) 


4.7 x lO" 9 
1.4 x lO* 
1.4 x lO" 11 



n 
a 

2 

3 

1 

© 



All P values shown are two sided. Shown are the numbers of cases and controls (/V), allelic frequencies of variants in affected and control individuals, the allelic odds ratio (OR) with 
95% confidence interval (95% c.i.) and P values based on the multiplicative model. 

•For the combined study populations, the reported control frequency was the average, unweighted control frequency of the individual populations,. whereas the OR and the lvalues were estimated 
using the Mantel-Haenszel model. 
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for rs4430796 A and 1.27 for rsl 859962 G). For each copy of the 
at-risk alleles, carriers were diagnosed with prostate cancer 2 months 
younger for rs4430796 and 5 months younger for rsl 859962, 
compared with noncarriers with prostate cancer. However, this 
observation was not statistically significant and therefore requires 
ier investigation. 

We did not observe any interaction between the risk variants on 
17ql2 and 17q24.3; a multiplicative or log-additive model provided an 
adequate fit for the joint risk of rs4430796 and rsl 859962. We 
estimated genotype-specific ORs for each locus individually 
(Table 4). Based on results from all four groups, a multiplicative 
model for the genotype risk provided an adequate fit for rs4430796 at 
17ql2. However, for rsl859962 at the 17q24.3 locus, the full model 
provided a significantly better fit than the multiplicative model 
(P = 0.006), a result driven mainly by the Icelandic samples. 
Specifically, the estimated OR of 1.33 for a heterozygous carrier of 
rsl 859962 G was substantially higher than the 1.20 estimate implied 
by a multiplicative model. 

The SNPs rs7501939 and rs4430796 on 17ql2 are located in the first 
and second intron of the TCF2 gene, respectively. To the best of our 
knowledge, sequence variants in TCF2 have not been previously 
implicated in the risk of prostate cancer. More than 50 different 
exonic TCF2 mutations have been reported in individuals with renal 
cysts, maturity-onset diabetes of the young type 5 (MODY5), pan- 
creatic atrophy and genital tract abnormalities 12,13 . We sequenced all 
nine exons of TCF2 in 200 Icelandic men with prostate cancer and 200 
Icelandic controls without detecting any mutations explaining our 
association signal (data not shown). 



Notably, several epidemiological studies have demonstrated an 
inverse relationship between type 2 diabetes (T2D) and the risk of 
prostate cancer (see ref. 14 and references therein). A recent meta- 
analysis estimated the relative risk of prostate cancer to be 0.84 (95% 
confidence interval (c.i.), 0.71-0.92) among diabetes patients 14 . There- 
fore, we decided to investigate a potential association between T2D 
and the SNPs in TCF2 showing the strongest association with prostate 
cancer in our data. 

We typed the Illumina SNP rs7501939 in 1,380 individuals with 
T2D (males in this group were not known to have prostate cancer, 
according to the Icelandic Cancer Registry list of individuals with 
prostate cancer diagnosed from 1955 to 2006). When compared with 
9,940 controls not known to have either prostate cancer or T2D, 
rs7501939 C showed a protective effect against T2D (OR = 0.88, 
P = 0.0045) in these samples. For the same samples, allele A of the 
refinement SNP rs4430796 gave a comparable result (OR = 0.86, 
P = 0.0021). To validate this association, we typed both rs7501939 and 
rs4430796 in seven additional T2D case-control groups of European, 
African and Asian ancestry (Supplementary Note online). In all seven 
case-control groups, rs7501939 C and rs4430796 A showed a protec- 
tive effect against the disease (that is, an OR < 1.0). Combining results 
from all eight T2D case-control groups, including the Icelandic group, 
gave an OR of 0.91 (P = 9.2 x 1(T 7 ) for rs7501939 C and an OR of 
0.91 (P = 2.7 x 10" 7 ) for rs4430796 A (Table 5). In a joint analysis, 
the effect of rs4430796 remained significant with adjustment for 
rs7501939 (P = 0.016), whereas rs7501939 did not after adjusting 
for rs4430796 (P = 0.41). We note that the former was mainly driven 
by the data from West Africa, where the correlation between the two 



NATURE GENETICS VOLUME 39 | NUMBER 8 | AUGUST 2007 



979 



LETTERS 



Table 3 Association results for SNPs on 17q24.3 and prostate cancer in Iceland, The Netherlands, Spain and the US 



Frequency 



Study population (/Vcases//V controls) and variant (allele) 



Cases 


VAJI III Vlo 


OR (95% c.i.) 


P value 


0.489 


0.453 


1.16(1.07-1.26) 


3.1 x lO" 4 


0.451 


0.415 


1.16(1.07-1.26) 


- 3.3 x 10* 


0.538 


0.501 


1.16(1.07-1.26) 


3.0 x 10* 


0.542 


0.504 


.1.16 (1.07-1.26) .* 


2.0 x 10^ 


0.522 


0.456 


1.30(1.16-1.46) 


6.8 x 10" 6 


0.474 


0.428 


1.20(1.07-1.35) 


1.5 x lO" 3 


0.544 


0.488 


1.25(1.12-1.40) 


1.1 x 10" 4 


0.512 


0.476 


1.15(0.99-1.35) 


0.071 


0.455 


0.426 


1.13(0.96-1.32) 


0.14 


0.581 


0.552 


1.13 (0.97-1.32) 


0.13 


0.513 


0.456 


1.25(1.06-1.49) 


9.8 x 10" 3 


0.460 


0.416 


1.20(1.01-1.42) 


0.041 


0.549 


0.586 


0.86 (0.72-1.02) 


0.083 




0.463 


1.25(1.15-1.35) 


8.3 x lO" 8 . 




0.423 


1.18(1.09-1.28) 


7.0 x lO" 5 




0.542 


1112 (1.05-1,20) 


6.2 x 10" 3 




0.460 


1.20(1.14-1.27) 


2.5 x lO" 10 




0.421 


1.17 (1.10-1.24) 


8.1 x lO" 8 




0.532 


1.14(1.08-1.21) 


6.9 x 1CT 6 
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Iceland (1,501/11,290) 

rsl859962 (G) 
rs7214479 (T) 
rs6501455 (A) 
rs983085 (C) a 

The Netherlands (999/1,466) 

rsl859962(G) 
rs7214479(T) 
rs6501455 (A) 
Spain (456/1,078) 
rsl859962 (G) 
rs7214479 (T) 
rs6501455(A) 
Chicago (537/510) 
rsl859962 (G) 
rs7214479 (T) 
rs6501455 (A) 

All excluding Iceland ( 1,99 2/3, 054) b 
rsl859962 (G) 
rs7214479 (T) 
rs6501455 (A) 

All combined (3,493/14,344) b 

rsl859962 (G) 
rs7214479 (T) 
rs6501455 (A) 



i affected and control individuals, the allelic odds ratio (OR) with 



All P values shown are two sided. Shown are the numbers of cases and controls (/V), allelic frequencies of variants in i 

95% confidence interval (95% c.i.) and-P values based on the multiplicative model. , 
a SNPs rs983085 and rs6501455 were almost perfectly correlated (r 2 = 0.99), but rs983085 failed in genotyping in the non-Icelandic groups. b For the combined study populations, the reported 
control frequency was the average, unweighted control frequency of the individual populations, whereas the OR and the Pvalues were estimated using the Mantel-Haenszel model. 



SNPs is substantially lower than in individuals of European ancestry 
(r 2 = 0.22 and r 2 = 0.77 in the Yoruba and CEU HapMap samples, 
respectively). For T2D, a recent report 15 describes similar findings 
(OR = 0.89, P = 5 x KT 6 ) for allele G of the SNP rs757210, which is 
substantially correlated with rs4430796 A (D' = 0.96; r 2 = 0.62; based 
on the CEU HapMap data set). This reinforces the finding that one or 



more variants in TCF2 that confer risk of prostate cancer are 
protective against T2D. Notably, removing individuals with T2D 
from the Icelandic case-control group had minimal impact on the 
association of rs4430796 with prostate cancer (Supplementary Note). 

The more distal SNP, rsl859962, on chromosome 17q24.3 is in a 
177.5-kb LD block spanning positions 66.579 Mb to 66.757 Mb 



Table 4 Model-free estimates of the genotype OR of rs4430796 (A) at 17ql2 and rsl859962 (G) at 17q24.3 



Genotype OR a 



Study group and variant (allele) 


Allelic OR 


00 


OX (95% c.i.) 


XX (95% c.i.) . 


P value b 


P value c 


PAR 


Iceland 

rs4430796 (A) 
rsl859962 (G) 


1.20 
1.16 




1.12(0.97-1.29) 
1.35U.1&-1.54) 


1.40(1.19-1.64) 
1.33CL13-1.57) 


0.31 
3.4 x 10" 3 


8.3 x 10" 5 
2.3 x 10- 5 


0.14 
0.19 


All except Iceland 

rs4430796 (A) 
rsl859962 (G) 


1.24 
1.25 




1.34(1.18-1.52) 
1.32(1.17-1.49) 


1.56(1.32-1.84) 
1.57(1.33-1.84) 


0.12 
0.24 


4.5 x irr 7 
2.9 x icr 7 


0.23 
0.22 


All combined 

rs4430796 (A) 
rsl859962 (G) 


1.22 
1.20 




1.24(1.13-1.36) 
1.33(1.21-1.44) 


1.48(1.32-1.66) 
1.45(1.29-1.62) 


0.57 
6.0 x lO" 3 


2.0 x lO" 10 

5.1 x 10- 11 


0.19 
0.21 



PAR, population attributable risk; OR, odds ratio; 95% c.i., 95% confidence interval. 

•Genotype odds ratios for heterozygous (OX) and homozygous carriers (XX) compared with non-carriers (00). ^est of the multiplicative 
of freedom). c Test of no effect (the null hypothesis) versus the full model (two degrees of freedom). 



model (the null hypothesis) versus the full model (one degree 
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Table 5 Association results for SNPs in the TC/* gene on 17ql2 and type 2 diabetes 



Frequency 



Study population (/V cases//V controls) and variant (allele) 


Cases 


' Controls 


OR (95%.c.L) 


P value 


Iceland 8 (1,380/9,940) 

rs7501939 (C) 
rs4430796 (A) 


0.549 
0.482 


0.582 
0.521 


0.88 (0.80-0.96) 
0.86 (0.78-0.95) 


0.0045 
0.0021 


Denmark A (264/596) 

rs7501939 (C) 
rs4430796 (A) 


0.525 
0.452 


0.593 
0.530 


0.76 (0.62-0.93) 
0.73 (0.60-0.90) 


0.0088 
0.0032 


Denmark B (1,365/4,843) 

rs7501939(C) 
rs4430796 (A) 


0.579 
0.507 


0.596 
0.528 


0.93 (0*85-1.02) 
0.92 (0.85-1.00) 


0.11 
0.062 


Philadelphia (457/967) 

rs4430796 (A) 


0.569 
0.477 


0.613 
0.527 


0.83 (0.71-0.98) 
0.82 (0.70-0.96) 


0.028 
0.013 


Scotland (3,741/3,718) 

rs7501939(C) 
rs4430796 (A) 


0.607 
0.517 


0.615 
0.526 


0.97 (0.91-1.03) 
0.97 (0.91-1.03) 


0.31 
0.29 


The Netherlands (367/915) 

rs7501939 (C) 
rs4430796 (A) 


0.563 
0.494 


0.579 
0.506 


0.94(0.79-1.11) 
0.95(0.79-1.14) 


0.46 
0.58 


Hong Kong (1,495/993) 

rs7501939 (C) 


0.768 


0.791 


0.87 (0.76-1.00) 


0.054 


rs4430796 (A) 


0.731 


0.754 


0.89 (0.78-1.01) 


0.073 


West Africa b (867/1,115) 

rs7501939 (C) 
rs4430796 (A) 


0.400 
0.271 


0.437 
0.313 


0.87 (0.77-0.99) 
0.80 (0.69-0.92) 


0.0024 


All groups excluding Iceland 

rs7501939 (C) 
rs4430796 (A) 






0.91 (0.87-0.95) 
0.92 (0.88-0.95) 


3.4 x icr 5 
1.8 x lfr 5 


AH groups combined (9,936/23,087) 

rs7501939 (C) 
rs4430796 (A) 






0.91 (0.87-0.94) 
0.91 (0.87-0.94) 


9.2 x 10" 7 
2.7 x lO" 7 


All P values shown are two sided. Shown are the numbers of cas 


;es and controls iN), allelic frequencies of variants in affected and control individuals, the allelic odds n 


atio (OR) with 
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95% confidence interval (95% c.i.) and P values based on the multiplicative model. 

-Men kmZto haTprostete cancer were excluded from the Icelandic T2D group (both affected individuals and controls). "Results for the five West African tnbes have been combined using a 
Mantel-Haensze! method. The frequency of the variant in West African affected individuals and controls is the weighted average over the five tribes. 



(National Center for Biotechnology Information (NCBI) build 35), 
based on the CEU HapMap group. The closest telomeric gene is SOX9, 
located -900 kb away from the LD block. One mRNA (BC039327) 
and several unspliced ESTs have been localized to this region, but it 
does not contain any known genes (University of California Santa 
Cruz Genome Browser, May 2004 assembly). RT-PCR analysis of 
various cDNA libraries, including those derived from the prostate, 
detected expression of the BC039327 mRNA only in a testis library 
(data not shown), in line with previously reported results 16 . 

In summary, we have found that two common variants on 
chromosome 17q, rs4430796 A and rsl 859962 G, contribute to the 
risk of prostate cancer in four populations of European descent. 
Together, based on the combined results, these two variants have an 
estimated joint population attributable risk (PAR) of —36%, which is 
substantial from a public health viewpoint. The large PAR is a 
consequence of the high frequencies of these variants. However, as 
their relative risks, as estimated by the ORs, are not high, the sibling 
risk ratio 17 that they account for is only —1.009 for each variant 
separately and - 1 .0 1 8 jointly. As a consequence, they can explain only 
a small fraction of the familial clustering of the disease and can 
therefore generate only modest linkage scores. We were most intrigued 
that the variant in TCF2 is associated with increased risk of prostate 



cancer but reduced risk of T2D in individuals of European, African 
and Asian descent. The discovery of a sequence variant in the TCF2 
gene that accounts for at least part of the inverse relationship between 
these two diseases provides a step toward understanding the complex 
biochemical checks and balances that result from the pleiotropic 
impact of singular genetic variants. Previous explanations of the 
well-established inverse relationship between prostate cancer and 
T2D have centered on the impact of the metabolic and hormonal 
environment of diabetic men. However, we note that the protective 
effect of both the TCF2 SNPs against T2D is too modest for its impact 
on prostate cancer risk to be merely a by-product of its impact on 
T2D. Indeed, we favor the notion that the primary functional impact 
of rs4430796 (or a presently unknown correlated variant) is on one or 
more metabolic or hormonal pathways important for the normal 
functioning of individuals throughout their lives that incidentally 
modulate the risk of developing prostate cancer and T2D late in life. 

METHODS 

Icelandic study population. Men diagnosed with prostate cancer were identi- 
fied based on a nationwide list from the Icelandic Cancer Registry (ICR) that 
contained all 3,886 Icelandic prostate cancer patients diagnosed from January 1 , 
1955, to December 31, 2005. The Icelandic prostate cancer sample collection 
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included 1,615 patients (diagnosed from December 1974 to December 2005) 
who were recruited from November 2000 until June 2006 out of the 1,968 
affected individuals who were alive during the study period (a participation 
rate of about 82%). A total of 1,541 affected individuals were included in a 
genome-wide SNP genotyping effort, using the Infinium II assay method and 
the Illumina Sentrix HumanHap300 BeadChip. Of these, 1,501 (97%) were 
successfully genotyped according to our quality control criteria (Supplemen- 
tary Methods online) and were used in the present case-control association 
analysis. The mean age at diagnosis for the consenting patients was 71 years 
(median 71 years; range, 40-96 years), and the mean age at diagnosis was 
(0 73 years for all individuals with prostate cancer in the ICR. The median time 
^ from diagnosis to blood sampling was 2 years (range, 0-26 years) (see ret 1 for 
§ a more detailed description of the Icelandic prostate cancer study population). 
g> No significant difference was seen in frequencies of rs7501939 (C), rs4430796 
a (A) or rsl859962 (G) between men diagnosed before 1998 and those diagnosed 
g in 1998 or later (P = 0.74, P = 0.87 and P = 0.35, respectively). More 
g! specifically, using only cases diagnosed in 1998 or later (N = 880) versus all our 
8 controls (N = 11,289), we obtained OR values of 1.16 (P = 0.004), 1.20 (5.5 x 
2> 10" 4 ) and 1.20 (5 x KT*) for rs7501939 (C), rs4430796 (A) and rsl859962 (G), 
3 respectively. The 11,290 controls (5,010 males and 6,280 females) used in this 
c . study consisted of 758 controls randomly selected from the Icelandic genealo- 
| gical database and 10,532 individuals from . other ongoing genome-wide 
^ association studies at deCODE (specifically, ~ 1,400 from studies on T2D, 
o. ~ 1,600 from studies on breast cancer and 1,800 from studies on myocardial 
2: infarction; studies on colon cancer, anxiety, addiction, schizophrenia and 
q. infectious diseases provided ~ 700-1,000 controls each). The controls had a 
g mean age of 66 years (median, 67 years; range, 22-102 years). The 
£ male controls were absent from the ICR's nationwide list of prostate 
cancer patients. 

.E The study was approved by the Data Protection Commission of Iceland and 
"£ the National Bioethics Committee of Iceland. Written informed consent was 
S obtained from all patients, relatives and controls. Personal identifiers associated 
^ with medical information and blood samples were encrypted with a third-party 
0> encryption system as previously described 18 . 

3 

^ Study populations from The Netherlands, Spain and the US. The total 
rs. number of men with prostate cancer from the Netherlands in this study was 
§ 1 ,01 3, of whom 999 (98%) were successfully genotyped. This study population 
comprised two recruitment sets of men with prostate cancer: Group A, 
® comprising 390 hospital-based affected individuals recruited from January 
1999 to June 2006 at the Urology Outpatient Clinic of the Radboud University 
k Nijmegen Medical Centre (RUNMC), and Group B, consisting of 623 affected 
idividuals recruited from June 2006 to December 2006 through a population- 
based cancer registry held by the Comprehensive Cancer Centre East. Both 
groups were of self-reported European descent. The average age at diagnosis for 
patients in Group A was 63 years (median, 63 years; range, 43-83 years). The 
average age at diagnosis for patients in Group B was 65 years (median 66 years; 
range, 43-75 years). 

The 1,466 control individuals from The Netherlands were cancer free and 
were matched for age with the cases. They were recruited as part of the 
Nijmegen Biomedical Study, a population-based survey conducted by the 
Department of Epidemiology and Biostatistics and the Department of Clinical 
Chemistry of the RUNMC in which 9,371 individuals participated from a total 
of 22,500 age- and sex-stratified randomly selected inhabitants of Nijmegen, 
The Netherlands. Control individuals from the Nijmegen Biomedical Study 
were invited to participate in a study on gene-environment interactions in 
multifactorial diseases such as cancer. All the 1,466 participants in the present 
study are of self-reported European descent and were fully informed about the 
goals and the procedures of the study. The study protocol was approved by the 
Institutional Review Board of Radboud University, and all study subjects gave 
written informed consent. 

The Spanish study population consisted of 464 men with prostate cancer, of 
whom 456 (98%) were successfully genotyped. The cases were recruited from 
the Oncology Department of Zaragoza Hospital in Zaragoza, Spain, from June 
2005 to September 2006. All were of self-reported European descent. Clinical 
information, including age at onset, grade and stage, was obtained from 
medical records. The average age at diagnosis for the patients was 69 years 



J7 



(median, 70 years; range, 44-83 years). The 1,078 Spanish control individuals 
were approached at Zaragoza University Hospital and were confirmed to be 
prostate cancer free before they were included in the study. Study protocols 
were approved by the Institutional Review Board of Zaragoza University 
Hospital. All subjects gave written informed consent 

The Chicago study population consisted of 557 men with prostate cancer, of 
whom 537 (96%) were successfully genotyped. The affected individuals were 
recruited from the Pathology Core of Northwestern University's Prostate 
Cancer Specialized Program of Research Excellence (SPORE) from May 2002 
to September 2006. The average age at diagnosis for the affected individuals was 
60 years (median, 59 years; range, 39-87 years). The .514 European American 
controls were recruited as healthy control subjects for genetic studies at the . 
University of Chicago and Northwestern University Medical School. Study 
protocols were approved by the Institutional Review Boards of North- 
western University and the University of Chicago. AD subjects gave written 
informed consent. 

For description of the diabetes case-control groups, see the Supple- 
mentary Note. 

Association analysis. All Icelandic case and control samples were assayed with 
the Illumina Infinium HumanHap300 SNP chip. This chip contains 317,503 
SNPs and provides about 75% genomic coverage in the Utah CEPH (CEU) 
HapMap samples for common SNPs at r 2 2: 0.8. FOr the association analysis, 
310,520 SNPs were used; 6,983 SNPs were deemed unusable owing to reasons 
such as monomorphism, low yield (<95%) and failure of Hardy- Weinberg 
equilibrium (HWE) (Supplementary Methods). Samples with a call rate 
<98% were excluded from the analysis. Single- SNP genotyping for the five 
SNPs reported here and the four case-control groups was carried out 
by deCODE Generics, applying the Centaurus 19 (Nanogen) platform to 
all populations studied (Supplementary Methods and Supplementary 
Table 2a online). For the five SNPs genotyped by both methods in 
1,501 affected individuals and 758 controls from Iceland, the concordance rate 
for genotypes was >99.5% between the Illumina platform and the 
Centaurus platform. 

For SNPs that were in strong LD, whenever the genotype of one SNP was 
missing for an individual, the genotype of the correlated SNP was used to 
provide partial information through a likelihood approach, as we have done 
before 1 . This ensured that results presented in Tables 2-5 were always based on 
the same number of individuals, allowing meaningful comparisons of results 
for highly correlated SNPs. A likelihood procedure described in a previous 
publication 20 and implemented in NEMO software was used for the association 
analyses. We attempted to genotype all individuals and all SNPs reported in 
Tables 2-5. For each SNP, the yield was >95% in every group. The only 
exception was in the case of refinement marker rs4430796, which was not a part 
of the HumanHap 300 chip. For this SNP, using a single SNP assay to genotype, 
we attempted to genotype 1,883 of the 11,290 Icelandic controls (genotyping 
was successful for 99% of them (1,860 individuals)) as well as all affected 
Icelandic individuals and all individuals from the replication study groups. 
Most notably, for the 17ql2 locus, when we evaluated the significance of one 
SNP (for example, rs4430796, rs7501939 or rs3760511) with adjustment for 
one or two other SNPs, whether we used all 11,289 Icelandic controls that had 
genotypes for at least one of the three markers in Table 2 and handled 
the missing data by applying a likelihood approach as mentioned above or 
whether we applied logistic regression only to individuals that had genotypes 
for all three markers, the resulting P values are very similar. We tested the 
association of an allele with prostate cancer using a standard likelihood ratio 
statistic that, if the subjects were unrelated, would have asymptotically a y} 
distribution with one degree of freedom under the null hypothesis. Allelic 
frequencies rather than carrier frequencies are presented for the markers in the 
main text, but genotype counts are provided in Supplementary Table 3 online. 
Allele-specific ORs and associated P values were calculated assuming a multi- 
plicative model for the two chromosomes of an individual 21 . For each of the 
four case-control groups, there was no significant deviation from HWE in the 
controls (P > 0.01). When estimating genotype-specific OR (Table 3), we 
estimated genotype frequencies in the population assuming HWE. We feel that 
this estimate is more stable than an estimate calculated using the observed 
genotype counts in controls directly. However, we note that these two 
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approaches gave very similar estimates" in this instance. Results from multiple 
case-control groups were combined using a Mantel-Haenszel model 22 in which 
the groups were allowed to have different population frequencies for alleles, 
haplotypes and genotypes but were assumed to have common relative risks. All 
four of the European sample groups include both male and female controls. We 
did not detect a significant difference between male and female controls for 
SNPs in Tables 2-4 for each of the groups after correction for the number of 
tests performed. We note that for all the three significant variants (rs7501939, 
rs4430796 and rsl859962) reported in Tables 2 and 3, we did not detect any 
significant differences in frequencies among the different groups of affected 
individuals (see description of Icelandic control samples) that make up the 
Icelandic genome-wide control sets (P = 0.30, 0.55 and 0.88, respectively). The 
individuals with T2D were removed when this test was performed for 
1 rs7501939 and rs4430796. Our analysis of the data does not indicate any 
differential association by gender of rs7501939 or rs4430796 to T2D. We used 
linear regression to estimate the relationship between age at onset for prostate 
cancer and number of copies of at-risk alleles (for rs7501939 and rsl859962) 
carried by affected individuals, using group as an indicator. 

To investigate potential interaction between rs7501939 C and rsl 859962 G 
located at 17ql2 and 17q24.3, respectively, we performed two analyses. First, we 
checked for the absence of significant correlation between those alleles among 
cases. Second, using logistic regression, we demonstrated that the inter- 
action term was not significant (P = 0.57). The joint PAR was calculated as 
1 - ((1 - PAR,) x (1 - PAR 2 )), where PARi and PAR 2 are the individual PARs 
for each SNP calculated under the full model and assuming no interaction 
between the SNPs. 

: We note that for the SNP rs757210, others have reported the results for allele 
j A 15 . However, in the main text, we provide their corresponding results for the 
!> other allele (allele G of rs757210) because that allele was the one positively 
i correlated with our reported allele C of rs7501939. 

i Correction for relatedness and genomic control. Some individuals in the 
1 Icelandic case-control groups were related to each other, causing the afbre- 
i mentioned x 2 test statistic to have a mean >1. We estimated the inflation 
i factor by calculating the mean of the 310,520 y} statistics, which is 1.098. Using 
I a method of genomic control 23 to adjust for both relatedness and potential 
\ population stratification, results presented here are based on adjusting the 
j x 2 statistics by dividing each of them by 1.098. Supplementary Figure 1 is a 
1 Q-Q plot of the observed x 2 statistics, before and after adjustment, against the 
\ x 2 distribution with one degree of freedom. 

k URLs. Cancer Genetic Markers of Susceptibility Project: http://cgems.cancer. 
jov/. University of California Santa Cruz Genome Browser: http://www. 
genome.ucsc.edu. 

Requests for materials: kstefans@decode.is or julius.gudmundsson@decode.is 
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Haplotype relative risks: an easy reliable way to construct a proper 
control sample for risk calculations 

C. T. FALK and P. RUBINSTEIN 

The Lindsley F. Kimball Research Institute of The New York Blood Center, 310 E. 67th St., 

New York, NY 10021 

SUMMARY 

An alternative to Woolf 's (1955) relative risk (RR) statistic is proposed for use in calculating 
the risk of disease in the presence of particular antigens or phenotypes. This alternative uses, 
as the control sample, the parental antigens or haplotypes not present in the affected child. The 
formulation of a haplotype relative risk (HRR) thus eliminates the problems of sampling from 
the same homogeneous population to form both the disease sample and an appropriate control. 

We show that, in families selected through a single affected individual, where transmission 
of the four parental haplotypes can be followed unambiguously, the mathematical expectation 
of the HRR is identical to that of the RR. Since the sample formed from the 'non-affected' 
parental haplotypes is clearly from the same population as the disease sample, the HRR thus 
provides a reliable alternative to the RR. A further advantage obtains when family data arc 
being collected as part of a study since the control sample is then automatically contained in 
the family material. 

Data from studies of patients with insulin dependent diabetes mellitus (IDDM) are used to 
obtain an estimate of the risk to those with HLA antigens or phenotypes associated with IDDM 
using the HRR statistic. A comparison of the HRR's and RR's for these data is also presented. 

INTRODUCTION 

Relative risks have been used for some time to estimate the increased risk of contracting a 
disease, given that a certain condition (or trait) is present, over that of the group lacking the 
condition. This formal definition of a relative risk requires prospective information that is not 
easily obtained and the relative risk is often approximated by the more easily obtained cross 

P roduct Pr(e|aff)Pr( g [control) 

Pr(g|aff)Pr(Q| control)' 

where Q stands for the presence of the condition or trait and q for the lack of the condition, 
and the four terms are conditional probabilities as indicated. When the overall frequency of 
the disease in a population is low, this estimate will closely approximate the true relative risk. 
This odds ratio was proposed by Woolf (1955) to estimate the risk of contracting either peptic 
ulcers or stomach cancer for individuals of particular ABO phenotypes. Since then it has been 
used to calculate risks for genetic markers associated with many diseases and its most notable 
use has been in studying several HLA-associated diseases such as insulin dependent diabetes 
mellitus (IDDM), coeliac disease, multiple sclerosis and ankylosing spondylitis. Several assump- 
tions are generally made about the underlying population from which both the disease sample 
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and the control sample are obtained, most importantly that both samples are drawn from the 
same genetically homogeneous population in an unbiased way. By this we mean that the disease 
sample should be selected on a clear-cut ascertainment criterion, e.g. randomly chosen affcetod 
individuals with no bias pertaining to other factors, and the control sample should be a strictly 
random sample from the same genetic population. In practice, this latter criterion is rather 
difficult to fulfil and most often the control is created from conveniently available data drawn 
from a population thought to be somewhat closely related to that from which the disease sample 
was drawn. 

Several years ago we proposed (Rubinstein el at. 1981) an alternative method for obtaining 
the control sample for relative risk (RR) estimations that eliminated the problems of sampling 
from a single homogeneous population. This method used, as a control, those parental 
haplotypes not present in the affected child and was therefore called the haplotype relative risk 
(HRR). This method has several appealing features including freedom from collection of proper 
control samples. Additionally, where families are to be studied anyway, collection of the family 
data automatically includes collection of the necessary control sample. It is, however, necessary 
to demonstrate that the HRR estimate has the appropriate characteristics. In this paper we 
will show that . assuming the ' ideal ' conditions inherent in the definition of RR, namely, control 
and disease samples both randomly chosen from the same homogeneous random mating 
population, the expected value of the HRR is identical to that of the conventional RR. We 
will then illustrate its use in the estimation of risks for HLA antigens and phenotypes associated 
with 1 1) DM. 

THE MODEL 

Consider a set of families that has been ascertained through a single affected child, where the 
relevant disease locus is closely linked to a normal polymorphic genetic marker (e.g. HLA) and 
where certain alleles (antigens) are associated with the disease. For purposes of concreteness, 
we will assume that the disease is recessively inherited, although the same arguments hold for 
dominance and for other inheritance models as well. Assume that the HLA haplotypes present 
in the parenls can be followed unambiguously in transmission to the offspring and designate 
the two inherited by the affected child as V (paternal) and V (maternal). Thus haplotypes 
a and c are assumed to carry the disease allele, say 'n'. In the special case where the child 
as well as both parents are ac. it is not certain whether the child gets the o from the mother 
or the father. However, it is still known that one a and one c haplotype were transmitted to 
the affected child, and thus carry the n allele, and that the haplotypes not passed on to the 
affected child were also a and c. The latter can therefore be included in the 'random sample' 
as described below. Now if we have truly obtained our sample as a random, singly selected 
sample, the two parental haplotypes not transmitted to the affected child (say 6 and d) will 
represent a random sample of haplotypes from the population at large and will thus carry the 
disease allele («) or the normal allele (N) with probabilities equal to the allele frequencies in 
the population (say Pl and p 2 , respectively, + p 2 = 1 )• The validity of this observation requires 
compliance with certain other assumptions including (1) that the parents are not inbred, (2) 
that there is no correlation within or between parental phenotypes and (3) that there is no 
differential fertility of the disease phenotypes. 

Now assume that an antigen Q' at the HLA locus is in positive linkage disequilibrium with 
n. the disease allele. We wish to calculate the relative risk to carriers of Q of contracting the 
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disease. We will use as our control population the set of '6' and 'd' haplotypes from our sample 
of disease families (that is, those haplotypes within a family not carried by the single affected 
proband) Using this control we will then calculate the conventional cross product odds ratio 
given above to obtain the haplotype relative risk (HRR). Define the relevant population 
frequencies as follows : , 

f(Q) = q v 

f(q) = q 2 = i—qi (where q represents all other alleles), 

/(») = Pv 

f(N)=p s = l- Pl , 
AQn) =x t = p 1 q 1 +9, 
f(QN) = x i = p 2 q l -S, 

f(qn)=x 3 =p 1 q i -8, 
/(gJV) = x 4 =p 2 g 2 +*, 

where d is the measure of disequilibrium between n and Q. 

We now need the four conditional probabilities necessary for the odds ratio. For the affected 
sample these are the same, regardless of how we choose our control. 

p\-A 
~ Pl ' 

Pr(not(2!aff ) = ^ ? = | 

Now since the control haplotypes will be a random sample from the population, the conditional 
probabilities will be: = l-<* 3 +* 4 )' = 

Pr(not Q\ control) = (x 3 + * 4 ) 2 = ql 
Thus the estimate of the HRR is : 

Pr(Q|aff) Pr(not Q|control) 
HRR - pr(not Q | ftff) Fr{Q | con trol) 

_ kP\-A)q\ 

x|(l -ql)' 

which is identical to the equivalent expression for the conventional RR. 

EXAMPLE 

Using data collected for the 9th HLA Workshop (Bertrams & Baur, 1984) we looked at the 
sample of families, submitted for study, where a single child was affected with IDDM and where 
the ethnic background was caucasoid (Western European or North American). The patients 
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Table 1 . DR phenotypes of IDDM disease sample, simplex cases 



DR type 


No. obs. 


No. exp. 


DR3, 3 


6 




DR3. 4 




1 8*2 


DR4, 4 


4 


107 


DR3, X 


16 


191 


DR4, X 


29 


224 


DRX. X 


10 


117 


Tot a J 


go 


899 



p(DR3) = 0-294: p(DR4) = 0344 : p(DRX) = 0-361 ; a//? = (0-278)/(O202) = 138. 
Table 2. DR phenotypes of control sample consisting of non-affected parental haplotypes 



DR type 


No. obs. 


No. exp 


DR3. 3 


0 


077 


DR3, 4 


2 


1 23 


DR4. 4 


0 


049 


DR3, X 


13 


1226 


DR4, X 


10 


97 6 


DRX, X 


48 


48*49 


Total 


. 73 


7300 



/>(/>/«) = 0-103: p(l)R4) = 0082: P(DBX) = 0-815; x* = »'*9, 2 - d f - 

were categorized with respect to their HLA DR phenotypes using three distinct allelic groups 
DR3. DR4. and DRX. where DRX represents all other DR antigens except DR3 and DR4. The 
results are shown in Table 1 with estimated allele frequencies and observed and 'Hardy- 
Weinberg expected * numbers for each phenotypic class. The a/fi ratio of Falk et cU. (1983) was 
also calculated and found to be 1-38. This ratio relates the observed frequency (a) of, say the 
DR3.4 phenotype, to the Hardy-Weinberg expected frequency [fi = 2p(DR3)p(DR4)} in a 
sample of diseased individuals (Table 1). A value in excess of 1-0 is an indication that the 
associated suscept ibility locus does not show a simple dominant or recessive mode of inheritance 
with a single susceptibility allele. The value of 138 found here is characteristic of samples of 
1 DDM individuals where an excess of DR3, 4*s is often observed thus suggesting a more complex 
mode of inheritance for susceptibility (Falk, 1984). The 'control group' was made up of the 
parental haplotype pairs not present in the affected child (only families in which all four HLA 
haplotypes could be followed were used). There were 146 parental control haplotypes. The allele 
frequencies for DR3. DR4, and DRX in this group were 0103, 0082, and 0815 respectively. 
These values agree remarkably well with the total frequencies obtained for the 4 random mating 
population ' comprising all caucasoid random individuals submitted to the 9th HLA Workshop 
(Baur el al 1984) (see. e.g. the table on page 694, where the DR marginal frequencies are 0-122, 
0 129. and 0-749 for the same three DR alleles). If the control haplotypes from each family 
are assumed to be a control individual', we obtain a control population sample of 73 which 
is in H-W equilibrium ( f = 1-79. 2 d.f., see Table 2). 

In Table 3, we compare the HRR's for DR3 and DR4 to the RR's calculated using a 
'contrived control population ' from the 9th HLA Workshop population data referred to above. 
This population' is assumed to be in H-W equilibrium and our 'random sample' is of the same 
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Table 3. HRRs and RRsfor the DR3 and DR4 antigens in a sample of simplex WDM patients 

(The control for the HRR's is the sample of parental haplotypes not present in the affected individuals. 
The control for the RR's was obtained by 'creating' a H-W sample assuming the antigen frequencies 
- recorded for the 9th HLA workshop (Baur et al. 1984).) 



control 



Disease 
control 



HRR 

DR 3 
+ - 

47 43 90 Disease 
15 58 73 control 
62 101 163 
HRR = 423 

p = 2 6 X IO" 5 

DR4 

+ - 

58 32 90 Disease 
12 61 73 control 
70 93 163 
HRR = 921 

p = 76 x 10" 10 



RR 

DR3 
+ - 
47 43 90 

21 69 90 
68 112 180 

RR = 359 
p =5.3x10-* 

DR4 
+ - 
58 32 90 

22 68 90 
80 100 180 

RR = 560 

p = 6-8 x 10" 



Table 4. HRRs and RRs for the DR3, 3, DR3, 4 and DR4, 4 phenotypes 

(Samples are the same as those described in Table 3. In each case comparison is made relative to the 
'base group' DRX, X to avoid the problems of non-independent risk estimates.). 





Disease 


Parental 


Workshop 


DR type 


sample 


control 


control 


DR3, 3 


6 


0 


i-3 


DR3, 4 


25 


2 


2-8 


DR4, 4 


4 


0 


i"5 


DR3, X 


16 


13 


164 


DR4,X 


29 


10 


IT4 


DRX, X 


10 


48 


50*5 


Total 


90 


73 


899 


HRR 




RR 




HRR(3, 4) = 


6o*o 


RR(3, 4) 


= 45-1 


HRR(3, 3) - 


undefined 


RR(3, 3) 


= 233 


HRR(4, 4) = 


undefined 


RR(4, 4) 


= 13*5 



If 'expected values' are substituted for the zero observations in the parental control, one gets: 

HRR'(3, 3) = 37 4. 
HRR'(4, 4) - 39*2. 



size as our disease sample (i.e. 90 individuals). Table 4 gives HRR's and RR's for the three DR 
phenotypes DR3, 3, DR3, 4, and DR4, 4 using the same samples. Here the risks are compared 
to the baseline phenotype DRX, X in each case since the risks are not independent (cf. 
Curie-Cohen, 1981, Svejgaard & Ryder, 1981). Note that the HRR's for DR3, 3 and DR4, 4 
are undefined since there are no ' individuals ' with those phenotypes in the control sample of 
73. If expected values are substituted for the 1 zero ' values in those cases HRR's can be estimated 
as given at the bottom of Table 4, but the use of such estimates must be made with caution. 



it) 
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DISCUSSION 

One of the major problems inherent in proper calculations of relative risks (RR's) is that of 
choosing an appropriate control. A basic assumption in the use of RR's is that both the affected 
sample and the control sample are chosen at random from the same genetically homogeneous 
random mating population with no selection criteria except for the disease status required for 
inclusion in the affected sample. In practice this is a difficult criterion to fulfil. Additionally, 
it adds a significant amount of work to select and test such a control sample. It is therefore- 
often assumed that the control sample is simply a hypothetical sample created from a population 
thought to be similar to that of the disease sample and 'generated' from that population by 
assuming H-W equilibrium and some reasonable sample size (cf. Svejgaard & Ryder, 1981, and 
our * contrived* sample of the previous section). 

Given the known heterogeneity of current urban populations, even within the less hetero- 
geneous European countries, use of population control data culled, for example, from HLA 
workshop surveys, may alter the significance of calculated RR's. Although, in the examples 
given here the results are significant for both RR's and HRR's (Table 3), the > values' for 
significance differ by two-fold (for DK3) and 100-fold (for DR4), with the HRR's being more 
significant in each case. If less extreme samples were tested, careless choice of the control group 
could very well make the difference between statistical significance and non-significance 
(resulting in either a type I or a type 11 error). 

Methods have previously been proposed for using sibship information to calculate 'risks'. For 
example. Clarke (1961 ) describes a method, attributed to C. A. B. Smith, for using sibships to 
test for a significant risk of duodenal ulcers to individuals of blood group O. The method used 
is somewhat different from that described here in that an observed and expected probability 
of being group O is assigned to the propositus in each sibship where the expected value depends 
on the makeup of the sibship. The significance is then based on a comparison of pooled observed 
and expected values over a set of sibships. This method does overcome the problem of 
heterogeneity but. because of the way the test is constructed, only a small part of the data can 
be used. In Clarke's example, therefore, the associations found when using the general 
population as a control were very much decreased when using Smith's sibship method. This does 
not seem to be the case using HRR's where the associations remain strong. 

By using the two parental haplotypes not present in the single diseased individuals of the 
disease sample as the control sample', we are assured of having both samples from the same 
genetic population and. as was demonstrated above, this sample should represent a random 
sample of haplotype pairs (or individuals') from that population. Care must still be taken to 
ensure that the population chosen is genetically homogeneous, to the extent possible, but the 
task of obtaining an appropriate control is simplified. 

If the disease is dominant rather than recessive, the HRR can still be used in the same way. 
Although it is not known whether the disease allele is present on the paternal haplotype ('a 9 ) 
or the maternal (V) or perhaps on both, the other two parental haplotypes, b and d, will still 
represent random haplotypes from the underlying population, provided that the conditions 
mentioned for the recessive case obtain. 

If m family is selected through more than one affected child, the situation is somewhat 
different . If the two affected sibs share the same two HLA haplotypes then the other two should 
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still represent random haplotypes from the population. However, if they share fewer than two 
haplotypes, the situation is more complicated. Now three (or possibly four) haplotypes are 
known to carry the disease allele in the recessive case. If the disease is dominant, it is possible, 
but not certain, that a single shared haplotype carries the disease allele. If no haplotype is 
shared, it is not possible to define disease-carrying haplotypes with certainty. In such cases it 
would therefore be difficult to define a control sample of random haplotypes meeting the 
necessary criteria. 

Two other points should be emphasized. If there is differential selection between genotypes 
at the susceptibility locus, (e.g. reduced fertility) a bias might be introduced such that the 
control haplotypes could no longer be considered a random population sample. Thus we require 
compliance with assumption (3) of our model to ensure the proper distribution of susceptibility 
alleles in the 'control' haplotypes. 

Further, if the population from which the sample is drawn is genetically heterogeneous with 
respect to the disease, the HRR as well as the RR may be difficult to interpret as well as to 
use. In an extreme case a population might be made up of two ethnically distinct subpopulations 
that do not interbreed. Assume that the disease of interest occurs in only one of two such 
subpopulations. An estimate of the HRR would come entirely from a sample taken from the 
subpopulation where the disease is present and. would be relevant only to that population 
(individuals in the other group having no risk, by definition). On the other hand, the RR would 
assign a risk over the entire population that would be too low for individuals in the susceptible 
part of the population and too high for individuals in the non-susceptible part. 

We wish to thank Drs Jurg Ott, Neil Risch and C. A. B. Smith for helpful and constructive comments on 
an earlier draft of this paper. 
This work was supported by NIH grant GM291 77. 
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Summary 

The role and limitations of retrospective investigations of factors possibly 
associated with the occurrence of a disease are discussed and their 
relationship to forward-type studies emphasized. Examples of situations 
in which misleading associations could arise through the use of inappropri- 
ate control groups are presented. The possibility of misleading associa- 
tions may be minimized by controlling or matching on factors which 
could produce such associations; the statistical analysis will then be 
modified. Statistical methodology is presented for analyzing retro- 
spective study data, including chi-square measures of statistical signifi- 
cance of the observed association between the disease and the factor 
under study, and measures for interpreting the association in terms of an 
increased relative risk of disease. An extension of the chi-square test 
to the situation where data are subclassiFied by factors controlled in the 
analysis is given. A summary relative risk formula, R, is presented and 
disclissed in connection with the problem of weighting the individual sub- 
category relative risks according to their importance or their precision. 
Alternative relative-risk formulas, Ru Ra, Rz, and R4, which require the 
calculation of subcategory-adjustea proportions of the study factor 
among diseased persons and controls for the computation of relative 
risks, are discussed. While these latter formulas may be useful in many 
instances, they may be biased or inconsistent and are not, in fact, aver- 
ages of the relative risks observed in the separate subcategories. Only 
the relative-risk formula, H, of those presented, can be viewed as such an 
average. The relationship of the matched-sample method to the sub- 
classification approach is indicated. The statistical methodology pre- 
sented is illustrated with examples from a study of women with epidermoid 
and undifferentiated pulmonary carcinoma.— J. Nat. Cancer Inst. 22t 719- 
748, 1959. 



Introduction 

A retrospective study of disease occurrence may be defined as one in 
which the determination of association of a disease with some factor is 
based on an unusually high or low frequency of that factor among diseased 
persons. This contrasts with a forward study in which one looks instead 

» Received for publication November 6 t 1968. 

t National Institutes of Health, Public Health Service, U.S. Department of Health, Education, and Welfare. 
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for an unusually high or low occurrence of the disease among individuals 
possessing the factor in question. Each approach has its advantages. 
Among the desirable attributes of the retrospective study is the ability to 
yield results from presently collectible data, whereas the forward study 
usually requires future observation of individuals over an extended period 
(this is not always true; if the status of individuals can be determined 
as of some past date, the data for a forward study may already be at 
hand) . The retrospective approach is also adapted to the limited resources 
of an individual investigator and places a premium on the formulation of 
hypotheses for testing, rather than on facilities for data collection. For 
especially rare, diseases a retrospective study may be the only feasible 
approach, since the forward study may prove too expensive to consider 
and the study size required to obtain a respectable number of cases 
completely unmanageable. 

In the absence of important biases in the study setting, the retrospec- 
tive method could be regarded, according to sound statistical theory, as 
the study method of choice. This follows from the much reduced sample 
sizes required by this approach and may be illustrated by the following 
extreme example. If a disease attack rate of 10 per 100,000 among 50 
percent of the population free of some factor were increased tenfold among 
the other half of the population subject to the factor, a retrospective study 
of 100 cases and 100 controls would, with high probability, reveal this 
significantly increased risk. On the other hand, a forward study cover- 
ing 2,000 persons, half with and half without the factor, would almost 
certainly fail to detect a significant difference. For comparable ability 
to find the type of increased risk just indicated, a forward study would 
need to cover about 500 times as many individuals as the corresponding 
retrospective study. The disparity in the required number of persons to 
be studied could, of course, be reduced by lengthening the follow-up period 
for forward studies to increase the experience in terms of person-years 
observed. The larger sample size required for the forward study reflects 
principally the infrequent occurrence of the disease entity under investiga- 
tion. In the example illustrated, uncovering 100 cases of disease in a for- 
ward study would require either 100,000 individuals with the factor or 
1,000,000 without. For diseases with a higher probability of occurrence 
the disparity in required size between retrospective and forward studies 
would be progressively reduced. 

The retrospective study might be looked upon as a natural extension 
of the practice of physicians since the time of Hippocrates, to take case 
histories as an aid to diagnosis. Its guise has varied with respect to the 
means of measuring the prevalence of the suspect factor among diseased 
persons and the criteria for determining unusual departures from normal 
experience. When an association is so marked, as in Percival Pott's 
observations on the representation of chimney sweeps among cases of 
scrotal cancer, no further quantitative data are required to perceive its 
significance. 

The retrospective approach has often been employed in studies of com- 
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municable diseases, one illustration being Snow's observations (1) on a 
common water supply for cholera cases in an area served by several sources 
(there would have been no element of unusualness had there been but one 
water supply). When a disease is epidemic in a circumscribed locality, 
the disease-free population in the same area offers a natural contrast. The 
method may be used successfully for endemic diseases as well. Holmes, 
in reac hi ng his conclusions on the communicable nature of puerperal fever 
(2), noted particularly that a large number of women with puerperal fever 
had been attended by the same physicians. In this context it should be 
emphasized that communicable disease investigations have often com- 
bined retrospective and forward study methods. For example, Snow 
supplemented his retrospective observations on water supply by a con- 
trast of cholera rates among subscribers of the Southwark and Vatixhall 
water company with the experience of persons served by the Lambeth 
water company within the same area. 

When a disease occurs sporadically, or its occurrence is not confined to 
a well-defined group (such as women at childbirth), a choice of controls 
is not immediately evident. For cancer and other diseases characterized 
by high fatality rates, a study restricted to decedents might use persons 
dying from other causes as controls. Rigoni Stern adopted this tech- 
nique in deducing the relationship of cancer of the breast and of the 
uterus to pregnancy history (S). Some contemporary studies have also 
used deaths from other causes as controls (4, 6). 

The present-day controlled retrospective studies of cancer date from 
the Lane-Claypon paper on breast cancer published in 1926 (6). This 
report is significant in setting forth procedures for selecting matched 
hospital controls and relating them to a consideration of study objectives. 
Retrospective techniques have since been applied in several investigations 
of cancer, including the following partial list of current references for a 
few primary sites: bladder (7-10), breast (11-18), cervix (18-16), larynx 
(17, 18), leukemia (19), lung (18, 20-27), and stomach (18, 28-80). 

Statisticians have been somewhat reluctant to discuss the analysis of 
data gathered by retrospective techniques, possibly because their train- 
ing emphasizes the importance of defining a universe and specifying rules 
for counting events or drawing samples possessing certain properties. 
To them, proceeding from "effect to cause," with its consequent lack of 
specificity of a study population at risk, seems an unnatural approach. 
Certainly, the retrospective study raises some questions concerning the 
representative nature of the cases and controls in a given situation which 
cannot be completely satisfied by internal examination of any single set 
of data. 

Only a few published papers have treated the statistical aspects of 
retrospective studies. Cornfield discussed the problem in terms of esti- 
mated measures of relative and absolute risks arising from contrasts of 
persons with and without specified characteristics (81). His paper was 
concerned with the simple situation of a homogeneous population of cases 
and controls, presumably alike in all characteristics except the one under 

Vol. 22, No. 4, April 1959 
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investigation, which could be represented by a single contingency table. 
In a later contribution he handled the problem of controlling for other 
variables by adjusting the distribution of controls to the observed dis- 
tribution of cases {16), Dora briefly mentions retrospective studies with 
emphasis on such topics as sources of data, choice of controls, and validity 
of inferences (S#). 

This paper presents a method for computing relative risks for retro- 
spective study contrasts, which controls for the effects of other variables 
by use of the basic statistical principle of subclassification of data. The 
related problem of significance testing is also considered. Since details 
of statistical treatment are conditioned by study objectives, data collec- 
tion methods, choice of a control series, and the use of matched or un- 
matched controls, these topics are also discussed briefly* 

Objectives 

Retrospective studies are relatively inexpensive and can play a valuable 
role as scouting forays to uncover leads on hitherto unknown effects, 
which can then be explored further by other techniques. The effects may 
be novel and not suggested by existing data, as in the pioneer work on the 
association of smoking and lung cancer or the association of blood type 
and gastric cancer, or they may represent refinements of current know- 
ledge. The latter category might include collection of lifetime residence 
and/or work histories to elaborate differences in incidence and mortality 
which appear when some diseases are classified by last place of residence 
or last occupation of the newly diagnosed case or decedent. 

With diseases of low incidence the controlled retrospective study may 
be the only feasible approach. Here emphasis should be placed on 
assembling results from several studies. Before accepting a finding and 
offering an interpretation, scientific caution calls for ascertaining whether 
it can be reproduced by others and in other administrative settings having 
their own peculiar biases. 

A primary goal is to reach the same conclusions in a retrospective study 
as would have been obtained from a forward study, if one had been done. Even 
when observations for a forward study have been collected, a supple- 
mentary retrospective approach to the same body of material may prove 
useful in collecting more data on points not covered in the original study 
design or in amplifying suggestive associations appearing in the initial 
forward-study results. 

The findings of a retrospective study are necessarily in the form of 
statements about associations between diseases and factors, rather than 
about cause and effect relationships. This is due to the inability of the 
retrospective study to distinguish among the possible forms of associa- 
tion — cause and effect, association due to common causes, etc. Similar 
difficulties of interpretation arise in forward studies as well. A forward 
study, to avoid these difficulties, would need to be performed with the 
preciseness of a laboratory experiment. For example, such a study of 
associations with cigarette smoking would require that an investigator 
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randomly assign his subjects in advance to the various smoking categories, 
rather than simply note the categories to which they belong. The 
inherent practical difficulties of such an enterprise are evident. 

In addition to the failings shared with the forward study, the retro- 
spective study is further exposed to misleading associations arising from 
the circumstances under which test and control subjects are obtained. 
The retrospective study picks up factors associated with becoming a 
diseased or a disease-free subject, rather than simply factors associated 
with presence or absence of the disease. The difficulties in this regard 
may be most pronounced when the study group represents a cross section 
of patients alive at any time (prevalence), including some who have been 
ill for a long period. Inclusion of the latter may lead to identification 
of items associated with the course of the illness, unrelated to increased 
or decreased risk of developing the disease. The theoretical point has 
been raised that factors conducive to longer survival of patients may be 
found in "prevalence" samples and interpreted erroneously as being 
associated with excess liability to the disease (88). Loopholes of this 
type are minimi zed when investigations are restricted to samples of 
newly diagnosed patients (incidence). 

A partial remedy for these uncertainties lies in employing a conserva- 
tive approach to interpretation of the associations observed. Recognizing 
the ease with which associations may be influenced by extraneous factors, 
the investigator may require not only that the measure of relative risk 
be significantly different from unity but also that it be importantly 
different. He may, for instance, require that the data indicate an 
increased relative risk for a characteristic of at least 50 percent, on the 
assumption that an excess of this magnitude would not arise from extrane- 
ous factors alone. However, the use of such conservative procedures 
emphasizes a corresponding need to pinpoint the disease entity under 
study. A strong relationship between a factor and a disease entity 
might fail to be revealed, if the entity was included in a larger, less well- 
defined, disease category. After the event from data now at hand, 
we know that a study of the association of cigarette smoking with epider- 
moid and undifferentiated pulmonary carcinoma is more revealing 
than an inquiry covering all histologic types of lung cancer. 

Multiple Comparison Problem 

The present-day retrospective study is usually concerned with investi- 
gating a variety of associations with a disease, little effort being involved 
in acquiring, within limits, added information from respondents. The 
results may be analyzed in a number of ways: the various factors may 
be investigated separately, without regard to the other factors; they may 
be investigated in conjunction with each other, a particular conjunction 
being considered a factor in its own right; or, more commonly, a factor 
may be tested with control for the presence or absence of other factors. 
Thus, if the role of cigarette smoking and coffee drinking in a given 
disease are under study, the possible comparisons include the relative 
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risk of disease for individuals who both smoke and drink as opposed 
to all other persons, or as opposed to those who neither smoke, nor drink 
coffee. In addition, the relative risk associated with smoking might he 
obtained separately for drinkers and nondrinkers of coffee, with a weighted 
average of these two relative risks constituting still another item. Con- 
versely, risks associated with coffee drinking, with adjustments for cigarette 
smoking, could be computed. 

The potential comparisons arising from a comprehensive retrospective 
study can be large. Almost any reasonable level of statistical significance 
used to test a single contrast, when applied to a long series of contrasts, 
will, with a high degree of probability, result in some contrasts testing 
significant, even in the absence of any real associations. The usual 
prescription for coping with this multiple comparison problem— requiring 
individual comparisons to test significant at an extreme probability level 
to reduce the number of associations incorrectly asserted to be true— 
would result only in making real associations difficult to detect. 

However, the multiple comparison problem exists only when inferences 
are to be drawn from a single set of data. If the purpose of the retro- 
spective study is to uncover leads for fuller investigation, it becomes 
clear there is no real multiple significance testing problem— a single 
retrospective study does not yield conclusions, only leads. Also, the 
problem does not exist when several retrospective and other type studies 
are at hand, since the inferences will be based on a collation of evidence, 
the degree of agreement and reproducibility among studies, and their 
consistency with other types of available evidence, and not on the 
findings of a single study. 

Nevertheless, it would be wise to employ testing procedures which do 
not lead to a superabundance of potential clues from any one study. 
This may be achieved by employing nominal significance levels in testing 
factors of primary interest incorporated into the design of an investigation 
and applying more stringent significance tests to comparisons of secondary 
interest or to comparisons suggested by the data. For the usual problem 
of multiple significance testing, this would be equivalent to allocating a 
large part of the desired risk of erroneous acceptance of an association as 
real to a small group of comparisons where fruitful results were anticipated, 
and parceling out the remainder of the available risk to the large bulk of 
comparisons of a more secondary nature. This minimizes the risk of 
diluting, through inclusion of many secondary comparisons, the chances 
for detecting an important primary effect. 

Representative Nature of Data 

The fundamental assumption underlying the analysis of retrospective 
data is that the assembled cases and controls are representative of the 
universe defined for investigation. This obligates the investigator not 
only to examine the data which are the end product but also to go behind 
the scenes and evaluate the forces which have channeled the material to 
his attention, including such items as local practices of referral to special- 
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ists and hospitals and the patient's condition and the effect of these items 
on the probability of diagnosis or hospital admission. We re-emphasize 
that this requires the exercise of judgment on the potential magnitude of 
biases and as to whether they could result in factors seeming to be related 
to a disease, in the absence of a real association of the factor with presence 
or absence of the disease. The danger of bias may be greatest in working 
with material from a single diagnostic source or institution. 

Among the more important practical considerations affecting retro- 
spective studies is that they are ordinarily designed to follow the line of 
least resistance in obtaining case and control histories. This means that 
cases and controls will often be hospital patients rather than persons in 
the general population outside hospitals. As a result, any factor which 
increases the probability that a diseased individual will be hospitalized 
for the disease may mistakenly be found to be associated with the disease. 
For example, Berkson (34) and White (85) have pointed out that positive 
association between two diseases, not present in the general population, 
may be produced when hospital admissions alone are studied, because 
persons with a combination of complaints are more likely to require 
hospital treatment. In theory, bias might also be produced in reverse 
manner, if the suspect factor diminished the probability of hospitalization 
for other diagnoses used as controls. The difficulties are not unique for 
hospital patients. Similar loopholes in interpretation may be advanced 
for any special groups used as sources of cases and controls. 

However, a mere catalogue of biases arising from the possibly un- 
representative nature of a sample of cases and controls should not ipso 
facto invalidate any study findings. This is a substantive issue to be 
resolved on its merits for a specific investigation. Collateral evidence 
may provide information on the potential magnitude of bias and the size 
of spurious associations which could result. In some situations the 
difference between cases and controls may be so great that postulation 
of an unreasonably large bias would be required. Whether he consciously 
recognizes it or not, the investigator must always balance the risks 
confronting him and decide whether it is more important to detect an 
effect, when present, or to reject findings, when they may not reflect the 
true situation. If opportunities for further testing exist, one should not 
be too hasty in rejecting an association as an artifact arising from the 
method of data collection, and in foreclosing exploration of a potentially 
fruitful lead. 

Because of the important role retrospective studies play in studies of 
human genetics, mention may be made of a bias frequently encountered 
in studies dealing with the familial distribution of diseases. A frequently 
used procedure takes a group of diagnosed cases for a disease in question 
and a group of controls and compares the prevalence of this disease among 
relatives of the probands and controls. The bias arises from the unrepre- 
sentative nature of the probands with respect to familial distribution and 
is known in other fields as "the problem of the index case" or "the effect 
of method of ascertainment." It has long been recognized that the 
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characteristics for a random sample of families will differ from those for 
families to whom the investigator's attention has been directed because 
the family rosters include individuals selected for study on the basis of a 
specified attribute. For example, data on family size (number of. 
children) obtained from siblings, rather than parents, are biased, since 
two or three potential index cases are present in the population for two- 
and three-child families as opposed to one for one-child families and none 
for childless couples. The analogy for disease occurrence is apparent. 
Families with two or three cases of the disease under study may have 
double or triple the probability of being represented by individuals in 
source material and having a representative selected as a proband than 
families with only one case. An appropriate analysis for this situation 
in studies of family size and birth order has been discussed by Greenwood 
and Yule (86), which takes account of the probability of family repre- 
sentation in proband data. Haenszel (87) has applied their correction 
to gastric-cancer data reported by Videbaek and Mosbech (88) and found 
the correction to reduce the originally reported fourfold excess of gastric 
cancer among relatives of probands, as compared to relatives of controls, 
to one of about 60 percent. 

One remedy for the weakness of the retrospective approach to problems 
involving association of diseases and familial distribution would be to 
place greater reliance on forward observations of defined cohorts for 
data on these topics. 

Controls 

While easier accessibility to and lesser expense of hospital controls are 
important considerations, they should not deter one from collecting con* 
trol data for a sample representing a more general population, if the latter 
are demonstrably superior. Some of the uncertainties about the supe- 
riority of hospital or general population controls arise from the need to 
maintain comparability in responses. The dependence of retrospective 
studies on comparability of responses from cases and controls cannot be 
overemphasized. When more accurate answers can be obtained from 
controls in a medical-care environment, the gain in comparability of 
responses for these controls could outweigh the other advantages to be 
derived from the more representative nature of general population controls. 
The difficulties may be illustrated by the experience with smoking 
histories. Hospital controls invariably yield a higher proportion of 
smokers for each sex than controls of comparable age drawn from the 
general population (27) . Does this mean more complete smoking histories 
are collected in hospitals or does it imply that smokers have higher hospital 
admission rates? If the first alternative is correct, hospital controls are 
the appropriate choice for measuring the association of smoking history 
with a given disease. The second alternative calls for general population 
controls and in this situation the use of hospital controls yields under- 
estimates of the degree of association. 

Dual hospital and general population controls would have some merit. 
If control data from the two sources were in agreement, this would rule 
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out some alternative interpretations of the findings. In the event of dis- 
agreement, its extent could be measured and alternate calculations made 
on the degree of association between an event and a suspect antecedent 
characteristic. Where the two sets of controls lead to substantially dif- 
ferent results, a cautious and conservative interpretation is indicated. 

Some topics, such as those bearing on sex practices and use of alcohol, 
may be amenable to study only within a clinical setting, and the collec- 
tion of general population data on these items may prove impractical. 
The limitations of general population controls in this regard may have 
been overstressed, and empirical trials to test what information can be 
collected in household surveys should be encouraged instead of dismissing 
the possibility with no investigation whatsoever. Whelpton and Freed- 
man, for example, have reported some success in collecting histories of 
contraceptive practices in interviews of a random sample of housewives 

(89). 

When hospital controls are chosen, some precautions may be built into 
the study. Within limitations on the nature of controls imposed by a 
study hypothesis, controls drawn from a wide variety of diseases or ad- 
mission diagnoses should be preferred. This permits examination of the 
distribution of the study characteristics among subgroups to check on 
internal consistency or variation among controls. This affords protection 
against two sources of error: a) attributing an association to the disease 
under investigation, when the effect is really linked to the diagnosis from 
which controls were drawn, and b) failure to detect an effect because both 
the study and control diseases are associated with the suspect factor. 
The latter is far from impossible. Both tuberculosis and bronchitis have 
exhibited association with smoking history and the use of one disease or 
the other as a control could easily lead to missing the association with 
smoking history. Similarly, patients with coronary artery disease would 
not constitute suitable controls for a study of the relationship of smoking 
and bladder cancer and vice versa, since the investigator would probably 
conclude that smoking was not related to either disease, when in truth it 
appears related to both. When there is definite evidence that two diseases 
are associated, for example, pernicious anemia and stomach cancer, the 
use of one as a control for the other is contraindicated, unless the study is 
specially designed to elucidate some aspects of the relationship. 

It is always advantageous to include several items in a questionnaire 
for which general population data are available. This could be considered 
a partial substitute for dual hospital and general population controls. 
Disparity among cases, hospital controls, and general population controls 
on several general characteristics unrelated to the study hypothesis may 
be regarded as warning signals of the unrepresentative nature of the 
hospital cases and controls. 

Where possible, interviews should be conducted without knowledge 
of the identity of cases and controls to guard against interviewer bias, 
although administrative reasons will often prevent attainment of ."blind" 
interviews. In cooperative studies employing several interviewers, the 
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magnitude of interviewer bias may be diminished, since it is unlikely 
that all interviewers will share the same bias in concert. In special 
circumstances, such as those prevailing at Roswell Park Memorial 
Institute, admissions may be interviewed before diagnosis, and hence 
before the identity of cases and controls is established. This feature 
requires a comprehensive, general purpose interview routinely admin- 
istered to all admissions, which may restrict its use to publicly supported 
institutions diagnosing and treating neoplastic diseases or other specialized 
disease entities. Several epidemiological contiibutions for specific cancer 
sites have been based on the unique control data available from Roswell 
Park Memorial Institute (9, 11, 12, SO, 40-48), which are particularly 
valuable for collation with studies depending on more conventional 
sources of controls to evaluate interviewer bias and related issues. 

Some patients interviewed as diagnosed cases will subsequently have 
their diagnoses changed. This may be turned to advantage. If scrutiny 
of the data for the erroneously diagnosed group reveals they had histories 
resembling those for the control rather than the case series, as Doll and 
Hill found in their study of smoking and lung cancer (SI), this would 
constitute evidence against interviewer bias. 

In investigations of a cancer site the association of a factor may often 
be restricted to a specific histologic type or a well-defined portion of an 
organ. The finding that epidermoid and undifferentiated pulmonary 
carcinoma is more strongly related to smoking history than adenocar- 
cinoma of the lung is now well established. The range of explanations 
for the observed deficit of epidermoid carcinoma of the cervix in Jewish 
women as compared to other white women is greatly circumscribed by 
the presence of about equal numbers of adenocarcinoma of the corpus in 
both groups. When these finer diagnostic details or their significance are 
unknown to the interviewer, another check on interviewer bias is provided. 
Furthermore, the confirmation in repeated studies of an association 
limited to a specific histologic type or a detailed site will lend credence 
to an etiological interpretation of the association. Repeated confirma- 
tion is an essential element. Otherwise, a very specific association may 
be a reflection of the multiple comparison problem; if enough contrasts 
axe created by fractionation of a single set of data, some apparently 
significant result is likely to appear. For this reason it would be desirable 
to reproduce such provocative results as Wynder's finding that use of 
alcohol was more strongly associated with cancer of the extrinsic larynx 
than of the intrinsic larynx (18), and Billington's report that prepyloric 
and cardiac neoplasms of the stomach were associated with blood group 
A and those located in the fundus with blood group O (44). 

Discussion of matched controls in relation to the analysis and the 
computation of relative risks is deferred to a later section. One con- 
sideration on matched controls arising in the planning and development 
of a study should be mentioned here. Obviously, if the risk of disease 
changes with age an apparent association of the disease with other age- 
related factors may result. Other apparent associations with race, sex, 
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nativity, etc., may arise in a similar manner. In devising rules for 
selecting controls, those factors known or strongly suspected to be related 
to disease occurrence should be taken into account if unbiased and more 
precise tests of the significance of the factors under investigation are 
desired. A sensible rule is to match those factors, such as age and sex, 
the effect of which may be conceded in advance and for which strong 
evidence is available from other sources, such as mortality data and 
morbidity surveys. When a factor is matched, however, it is eliminated 
as an independent study variable; it can be used only as a control on 
oth er factors. This suggests caution in the amount of matching attempted. 
If the effect of a factor is in doubt, the preferable strategy will be not to 
match but to control it in the statistical analysis. While the logical 
absurdity of attempting to measure an effect for a factor controlled by 
matching must be obvious, it is surprising how often investigators must 
be restrained from attempting this. 

When a minimum of matching is involved, the importance of estab- 
lishing, precisely and in advance, the method by which controls are 
selected for study increases. The rule should be rigid and unambiguous 
to avoid creating effects by subconscious selection and manipulation of 
controls. The problem is similar to that encountered in therapeutic 
trials where a protocol spelling out all the contingencies and actions to 
be taken in advance is, along with random assignment of cases and con- 
trols, the major bulwark against bias. 

To reduce interview time and expense there are advantages in pro- 
cedures for selecting controls which permit a case and the corresponding 
controls to be interviewed in a single session, particularly if travel to 
several institutions is involved. In practice, this favors selecting controls 
from a hospital patient census rather than from hospital admission lists. 
The difficulty with hospital admissions is that there is no guarantee that 
the controls will be available in the hospital at the time the diagnosed 
case is interviewed. This point seems more important than the fact 
that patients with diagnoses requiring long-term stays are overrepresented 
in a current hospital census {46). If the latter is an important issue, it 
may be handled in analysis through subclassification of controls by 
diagnosis. 

Normally there will be little difficulty in reconciling these considera- 
tions into a harmonious set of rules. The items to be matched often 
lend themselves to a procedure for specifying controls. In a recent 
study on female lung cancer we found that the definition of two controls 
as the next older and the next younger women in the same hospital 
service, present on the day the case was interviewed, met the requirements 
just outlined (27). The controls were uniquely defined, the records 
establishing their identity were readily available on the service floor, 
interviews could be completed in one day, and a provision for balancing 
ages of cases and controls was incorporated. Simultaneous interviews 
of cases and controls may be more than an administrative convenience. 
If the prevalence of the associated factor is rapidly shifting over time, 
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failure to control time of interview could obscure or exaggerate an 
association. 

Some Statistical Tools 

To progress further, questions on the representative nature of the case 
and control series must have been resolved affirmatively. With this 
condition in mind, let us suppose that a controlled retrospective study 
has been conducted and that the number of diseased cases, N x , consists 
of A individuals with the factor being investigated and B free of the 
factor, while the number of controls, N 2 , consists of C individuals with, 
and D individuals without the factor. Let M x = A + C, M 2 = B + D, 
T = Ni + N 2 = M x + M 2 = A + B + C + D. What statistical evi- 
dence is there for the presence of an association and what is an appro- 
priate measure of the strength of the association? 

A commonly employed statistical test of association is the chi-square 
test on the difference between the cases and controls in the proportion of 
individuals having the factor under test. A corrected chi square may be 
calculated routinely as 

(\AD-BC\-%T)*TIN l M i N 2 M 2 

and tested as a chi square with 1 degree of freedom in the usual manner. 

A suggested measure of the strength of the association of the disease 
with the factor is the apparent risk of the disease for those with the 
factor, relative to the risk for those without the factor. Consider that 
a population falls into the four possible categories and in the proportions 
indicated by the following table: 

With factor factor Total 



With disease Pi § £ + ft 

Free of disease P% Pa P% + Pi 



Total ' Pi +P* Pi + Pa 1 

The proportion of persons with the factor having the disease is 
P x /(P t + P 8 ), while the corresponding proportion for those free of the 
factor is P 2 f(P% + ft). Relatively then, the risk of the disease for those 
with the factor is P X (P 2 + P 4 )/ft(Pi + P 3 ). On a sampling basis this 
quantity may be estimated either by drawing a sample of the general 
population and estimating P u P 2} Pz, and P* therefrom or estimating 
P\I(P\ + Pz) Pa/CP* + Pa) separately from samples of persons with, 
and persons free of, the factor. 

It may be noted, however, that if the relative risk as defined equals 
unity, then the quantity PiPJP*Pz will also equal unity. Further, for 
diseases of low incidence where the values for P x and P 2 are small in 
comparison with P 3 and P4 it follows, as has been pointed out by Cornfield 
(81), that P X PJP%P 2 is also a close approximation to the relative risk. 
This latter approximate relative risk can properly be estimated from 
the two sample approaches described or from samples drawn on a retro- 
spective basis; that is, separate samples of persons with, and persons 
free of, the disease. The sample proportions of persons with, and free 
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of, the factor in the retrospective approach provide estimates of 
Pif(Pi + P a ) and of P 2 /(Pi + P 2 ) from the sample having the disease and 
of P 3 /(P a + Pa) and of P 4 /(P 3 + Pa) from the disease-free sample. The 
estimate of P x PJP 2 Pz is obtained by appropriate multiplication and 
division of these four quantities. 

Whichever of the three methods of sampling is employed, the estimate 
of the approximate relative risk, P x PJP 2 Pz, reduces simply to AD/BC, 
where A, B, C, and D are defined in the manner stated in the first para- 
graph of this section. Also, the chi-square test of association given, 
which is essentially a test of whether or not the relative risk is unity, is 
equally applicable to all three sampling methods. 

In the foregoing the two basic statistical tools of the epidemiologist 
for retrospective studies, the chi-square significance test and the measure 
of a relative risk, have been described for a relatively simple situation, 
one in which to all intents there is a single homogeneous population. 
The more complex situations confronting the epidemiologist in actual 
practice and the corresponding modifications in the statistical procedures 
will be presented. 

Two other statistical problems may be noted here. One is the deter- 
mination of how large a retrospective study to conduct. This depends 
on how sure we wish to be that the study will yield clear evidence that the 
relative risk is not unity, when it in fact differs from unity to some im- 
portant degree. Application of this statistical technique requires re- 
interpreting a relative risk greater than unity into the corresponding 
difference between the diseased and the disease-free groups in the propor- 
tion of persons with the factor. For example, suppose an attack rate of 
20 percent, given a normal rate of 10 percent, is worth uncovering. Sup- 
pose further that the factor associated with the increased disease rate 
affects 20 percent of the population. The population would then be 
distributed as follows: 

Free of 

With factor factor Total 



With disease Pi«4% P,=8% 12% 

Free of disease P»=16% P 4 =72% 88% 

Total 20% 80% 100% 

The required retrospective study should be large enough to differentiate 
between a 33.3 percent [Pi/(Px + P*)] relative frequency of the factor 
among diseased individuals and an 18.2 percent [Pa/CPa+PO] relative fre- 
quency among disease-free individuals. The usual procedures for deter- 
mining required sample sizes to differentiate between two binomial 
proportions are applicable in this situation. 

While rigorous extension of this procedure to the more complex situa- 
tions to be considered is not too simple, it can readily be adapted to 
secure approximations of the necessary study size. One might, for 
example, start by estimating the over-all required sample size following 
the procedure just indicated for differentiating between two sample 
proportions, assuming that cases and controls are homogeneous with 
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respect to factors other than the one under investigation. Suppose on an 
over-all basis it is determined that the study should include N x = 200 
disease cases and N 2 = 200 controls, but that the study data will be sub- 
classified for purposes of analysis. Ignoring mathematical complications 
resulting from variations in binomial parameter values within individual 
subclassifications, we may interpret the above values of Ni and N 2 as 
roughly meaning that the total information required for the study is 
NiNt/iNx + N 2 ) = 100. The objective should then be to assign values 
to N u and N 2i to obtain a total score of 100 for the cumulated information 
over all the subclassifications, 2NuNu/(Nu + N 2i ), where N u and N 2i 
are the number of cases and controls in the ith subclassification. 

This formulation of required total information brings out some aspects 
of retrospective study planning which are considered later in this paper. 
For instance, if any N u or N 2i is zero, no information is available from 
that particular category. Much of the benefit of a large Nu (or iVaO in 
any particular category is lost if the corresponding N 2i (or N xi ) is small. 
It is normally desirable to have N u and N^ values commensurate with 
each other; for fixed totals, XN U and XN 2if the total information in an 
investigation will be at a maximum if the degree of crossmatching is equal 
in all subclassifications with a constant case-control ratio of 2Nuf2N 2i . 
Maintaining a fixed case-control ratio among categories need not preclude 
assigning more cases and controls to specific categories. Larger numbers 
may be desired for categories of crucial interest to the study or for cate- 
gories which represent greater segments of the population. 

The information formula also reveals the limits for adjusting the relative 
numbers of diseased and control cases. It shows that if the number of 
controls (N 2 ) becomes indefinitely large, the required Ni value can at most 
be reduced only by a factor of 2. Furthermore, this reduction in required 
diseased cases may be inappropriate if one wishes to obtain clear results 
for the separate subcategories. 

The study size requirements suggested by the information formula may 
be seriously in error if the binomial parameters show excessive variation 
among subcategories. Ordinary precautions, however, should serve to 
keep the formula useful. In some situations it may be desirable to modify 
the information formula indicated above to reflect the contribution due 
to variation in the binomial parameters involved. 

The second statistical procedure involves setting reasonable limits on 
the relative risk when it is in fact different from unity. For the homo- 
geneous case considered, formulas for such limits have been published in 
(46) . The cbi-square test as stated is essentially a test of whether or not 
the confidence limits include unity. Extension of this procedure to more 
complex cases is fairly involved and depends primarily on the measure of 
relative risk adopted. In the absence of a clear justification for any single 
measure of over-all relative risk, the burden of extremely, involved compu- 
tation of confidence limits in such cases would not seem warranted. 
Instead, we feel that emphasis should be directed to obtaining an over-all 
measure of risk, coupled with an over-all test of statistical significance. 
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Statistical Procedures for Factor Control 

A major problem in any epidemiological study is the avoidance of spu- 
rious associations. It has been remarked that where the risk of disease 
changes with age, apparent association of the disease with other age- 
related factors can result. However, there are appropriate statistical 
procedures for controlling those factors known or suspected to be related 
to disease occurrence. They serve not only to remove bias from the 
investigation but, in addition, can add to its precision. 

Two simple procedures for obtaining factor control may first be men- 
tioned. One is simply to restrict the investigation to individuals homo- 
geneous on the factors to be controlled. For this situation the statistical 
procedures already outlined would be appropriate. The potential number 
of individuals available for such a study would, of course, be sharply 
restricted. 

There is also the matching case method. A sample of N diseased 
individuals is drawn and the characteristics of each individual noted with 
respect to the control factors. Subsequently, a sample of N well indi- 
viduals is drawn, with each individual matched on the control factors to 
one of the diseased individuals. The statistical procedures to be presented 
can be shown to cover the matched-sample approach as a special case, 
and a discussion of the analysis of such data will be given in that context. 
Some difiiculties of the matched-sample study may be mentioned here. 
One is that when matching is made on a large number of factors, not even 
the fiction of a random sampling of control individuals can be maintained. 
Instead, one must be grateful Hot each matching control available. 
Another difficulty is that the method cannot be applied to factors under 
control, since diseased and control individuals are identical with respect 
to these factors. Conversely, factors under study in matched samples 
cannot themselves be controlled statistically. They can be analyzed 
separately or in particular conjunctions but cannot be employed as control 
factors. 

An alternative to case matching is to draw independent samples of 
cases and controls, and adjust for other factors in the analysis. This 
approach requires simply the classification of individuals according to the 
various control and study factors desired, and an analysis for each separate 
subclassification as well as an appropriate summary analysis. Its success 
will depend on a reasonable degree of cross-matching between observations 
on diseased and control persons. In a small study various devices for 
reducing the number of subclassifications and for increasing the chances of 
cross-matching may be necessary, including a limit on the number of 
factors on which individuals are classified in any one analysis and the use 
of broad categories for any particular classification. Thus, a 10-year 
interval for age classification might permit a reasonable degree of cross- 
matching, whereas a 1-month interval would not. 

The need for some degree of deliberate matching, even when the 
classification approach is employed, can be seen. If the disease under 
consideration occurs at advanced ages, little cross-matching would result 
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if controls were selected from the general population. The remedy lies in 
deliberately selecting controls from the same age groups anticipated for 
persons with the disease, perhaps even matching one or more controls on 
age for each diseased person. This principle can be extended to matching 
on several control factors, solely jot the purpose of increasing the extent of 
cross-matching in the analysis. 

One of the subtle effects which can occur in a retrospective study, even 
with careful planning, may be pointed out. It can be shown, for instance, 
that within a given age interval the average age of individuals with cancer 
of certain sites will be greater than the average age of individuals from the 
general population in the same age interval. This can arise when incidence 
increases rapidly with age and may pose a serious problem with broad age 
intervals. This effect can be offset by close matching of cases and controls 
on age in drawing samples, even though they are classified by a broad age 
category in the analysis. 

When a random sample of diseased and disease-free individuals is 
classified according to various control factors the distribution of the factor 
under study within the ith classification may be represented as follows: 

Free of 

With factor factor Total 



With disease Ai B< Nu 

Free of disease d D< N u 

Total M u M u T { 

Within this subgroup the approximate relative risk associated with the 
disease may be written as AiDJBid. One may compare the observed 
number of diseased persons having the factor, A {f with its expectation 
under the hypothesis of a relative risk of unity, E{A^NiMulTt. 
The discrepancy between A t and E(A t ) (which is also the discrepancy for 
any other cell within a 2 X 2 table) can be tested relative to its variance 
which, subject to the fixed marginal totals — Nu, N 2 t, M u , and M 2i — is 
given by V(A t ) = NMiMiMii/TftTt-l). The corrected chi square 
with 1 degree of freedom (\At-E(Ai)\ -%) 2 /VW reduces in this case to 
(\AtDt -BiCi\ - %Ti)\Tt - \)lN u NiMiM %i . This formula for the variance 
of Ai is obtained as the variance of the binomial variable N X PQ(P = MJT, 
Q « Mt/T), multiplied by a finite population correction factor (T-N t )l 
(T-l) =a N 2 f(T- 1). The earlier chi-square formula, which is ordinarily 
used, essentially employs a finite population correction factor of N%jT. 

There is thus a difference between the two chi-square formulas of a 
factor of (T-l)/T which, though trivial for any single significance test 
with respectably large T, can become important in the over-all signifi- 
cance test. It is with the latter formula, just presented, that chi square 
is computed as the ratio of the square of a deviation from its expected 
value to its variance. 

The adjustment for control factors is at this point resolved for the result- 
ing separate subclassifications. The problem of over-all measures of 
relative risk and statistical significance still remains. A reasonable over-all 
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significance test which has power for alternative hypotheses, where there 
is a consistent association in the same direction over the various sub- 
classifications between the disease and a study factor, is provided by 
relating the summation of the discrepancy between observation and 
expectation to its variance. The corrected chi square with 1 degree of 
freedom then becomes {\ZAt-ZE(Ad\-fflPV<Ad where E ( A <) md 
V(Ai) are defined as above. 

The specification of a summary estimate of the relative risk associated 
with a factor is not so readily resolved as that for an over-all significance 
test, and involves consideration of alternate approaches to a weighted 
average of the approximate relative risks for each subclassification 
(AiDt/BiCt). If one could assume that the increased relative risk associ- 
ated with a factor was constant over all subclassifications, the estimation 
problem would reduce to weighting the several subclassification estimates 
according to their respective precisions. The complex maximum likeli- 
hood iterative procedure necessary for obtaining such a weighted estimate 
would seem to be unjustified, since the assumption of a constant relative 
risk can be discarded as usually untenable. 

Another possible criterion for obtaining a summary estimate of relative 
risk would involve weighting the risks for subclassification by "impor- 
tance." A twofold increase of a large risk is more important than a 
twofold increase of a small risk. An increased risk for a large group is 
more important than one for a small group. An increased risk for young 
individuals may be more important than for older individuals with a 
shorter life expectation. Difficulties arise in attempts to weight relative 
risk by measures of importance. For one, the necessary information on 
importance, in terms of the size of the populations affected or in terms 
of the absolute level of rates prevailing in the subgroups, is generally not 
contained within the scope of the investigation. A problem in definition 
of the precise terms of the weighted comparison also appears. Does 
one want to adjust the risks of disease among persons with the factor to 
the distribution of the population without the factor, or vice versa, or 
adjust the risks for the populations with and without the factor to a 
combined standard population? These procedures, and the different 
phrasing of the comparisons which they entail, could yield different 
answers. If only a small proportion of the population with the factor was 
in a subcategory with a high relative risk, while most of the factor-free 
population fell into this subcategory, and in other categories the relative 
risk associated with the factor was less than unity, the factor would appear 
to exert a protective influence under one set of weights but a harmful 
effect under the other. 

Published instances of summary relative risks do not fall clearly into 
either of the two categories— weighting by precision or weighting by 
importance. They do follow an approach usually employed in age-adjust- 
ing mortality data. Since the relative risk for a single 2X2 table ca,n be 
obtained from the incidence of the factor among diseased and well indi- 
viduals, the problem would appear translatable into terms of obtaining 
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over-all, category-adjusted incidence figures. Direct or indirect methods 
of adjustment can be used, employing as a standard of reference the fre- 
quency distribution or rates corresponding to the sample of diseased 
persons, of controls, or the diseased persons and controls combined. 

While such adjustment procedures provide weighting by importance 
in their customary application to mortality rates, this is not so in the 
relative risk situation. This may be illustrated in the following extreme 
example. Suppose that in each of two subcategories the approximate 
relative risk for a contrast between the presence and absence of a factor 
is about 5, which arises in the first subcategory from contrasting per- 
centages of 1 and 5, and in the second subcategory from contrasting per- 
centages of 95 and 99. If these percentages were based on equal numbers 
of individuals, all methods of category adjusting would yield contrasting 
adjusted summary percentages of 46 and 52, and a resultant relative risk 
of slightly less than 1.3. Some other approach for obtaining category- 
adjusted relative risks would seem desirable. However, to the extent 
that such extreme situations are not encountered in actual practice, results 
based on these more conventional adjustment procedures will not be 
grossly in error. 

A suggested compromise formula for over-all relative risk is given by 
B = ^{AiDJTi)l^(BiCilTi). As a weighted average of relative risks 
this formula would, in the illustration given, yield the over-all relative 
risk of 5 found in each of the two subcategories. The weights are of the 
order NxJti%J{Nu + N*t) &ad &s su <^h can be considered to weight approxi- 
mately according to the precision of the relative risks for each subcategory. 
The weights can also be regarded as providing a reasonable weighting 
by importance. 

An interesting property of this summary relative risk formula is that it 
equals unity only when 2^4 < = 2E(A { ) and hence the corresponding 
chi square is zero. From the fact that E(A t ) = (AiDt—BiCi)fT i9 
it follows that when 2A t — VE{A t ), 2A t Di/Ti will equal XBtCt/Tt, chi 
square will be zero, and JS will be unity. The chi-square significance test 
can thus be construed as a significance test of the departure of B from unity. 

Of some other procedures for measuring over-all relative risks, the one 
following also has the interesting property of being equal to unity when 
2(4,) = 2E(Ai) and therefore subject to the chi-square test: 

- NxMnlTu E(Ct) = NMJTu and E(D { ) « NM1/T+ 

In this formula the numerator represents the crude value for the relative 
risk, which would result from pooling the data into one table and ignoring 
all subclassification on other factors. The denominator represents the 
crude value for relative risk, which would have resulted from pooling in 
the situation where all relative risks within each subclassification were 
exactly unity. Readers familiar with the "indirect" method of com- 
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puting standardized mortality ratios will recognize an analogy between 
the "indirect" method and the above procedure. 

The estimator R x can be seen to have a bias toward unity. One reason 
is covered by the illustration which indicated that adjusted percentages 
(or frequencies) do not yield an appropriate adjusted relative risk. In 
addition, when either cases or controls have little representation in a 
subcategory, there will be lack of cross-matching and little information 
about relative risk, and the observed cell frequencies and their expecta- 
tions will be numerically close. Such results will, in the process of sum- 
mation used by the estimator, tend to force its value toward unity. This 
weakness will not be too important if the degree of cross-matching is 
roughly equal in the various subclassifications — an optimum goal one 
would normally attempt to achieve. The bias will become more pro- 
nounced as the number of control factors increases and as the prospects 
for good cross-matching become poorer. 

We used the estimator R x in a recent paper (87), knowing its potential 
weaknesses. This was done to present results more nearly comparable 
with those reported by other investigators using similarly biased esti- 
mators. One set of results from this paper on lung cancer among women 
illustrates the conservative behavior of estimator R x compared with R } as 
additional factors are controlled. The relative risk (R x ) for epidermoid 
and undifferentiated pulmonary carcinoma associated with smoking more 
than one pack of cigarettes daily as compared to nonsmokers decreased 
from 7.1 (controlled for age) to 5.6 (controlled for age and coffee consump- 
tion). The corresponding figures, with J? as a measure of relative risk, 
were 9.7 and 9.9. 

Computational procedures for R and R x are presented in table 1, drawing 
on material comparing smoking histories of women diagnosed as cases of 
epidermoid and undifferentiated pulmonary carcinoma with those of female 
controls. For simplicity in presentation only two smoking levels are con- 
sidered — nonsmokers and smokers of more than one pack of cigarettes 
daily. An extension of the significance testing procedures to the case of 
study factors at more than two levels is discussed later. The control 
factors are age and occupation. The basic data are given in the first 9 
columns. Columns 10 and 11 carry the derivative calculations required 
for R. Columns 12 and 13 are used in the computation for R x and for 
the variance estimate in column 14 — the latter being needed for the chi- 
square test. Only columns 1 to 10, 12, and 14 would be necessary to 
compute chi square, R and R\. Column 13 is not essential for the com- 
putation of E(D) but simplifies computation of V(A) } while providing a 
check on E(A).. Column 11 serves as a check on 10 and 12. A system 
of checks and computations is outlined at the bottom of table 1. Not all 
the computations shown would ordinarily be necessary for an analysis. 

The corrected chi-square value of 30.66 (1 degree of freedom) would 
indicate a highly significant association between epidermoid and undif- 
ferentiated pulmonary carcinoma and cigarette smoking in women, after 
adjusting for possible effects connected with age or occupation. The 
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Checks: Total discrepancy, Y, = ZA- S|U - | 

Z™ADm-Z(BCm 
2(16) = 64.000;JB(3) = 64 



f¥i - 2(10) - 



11.625 
11.625 
2(11) - 



11.625 



§17) + 2(18 = 249.000; 2(6) = 249 
Derivative cVmputations: XE{B) - 2(2) 



+ K = 57.625 
+ F = 24.625 
2(1) + 2(17) - 94.960 
Z(BT/zV,) = 2 2 + |jl| - 2J8-040 
ZiCTINl = 2(4) + 2(15) - 16.325 
zlDTINj = 2(5) + 2(16) 



■ZE&) = 2(4) 



= 296.675 



i «f 7? imnUes that the risk of these cancers is 10.7 times as great for 
value of R implies una* ww 110 . . d than for women who 
women currently »N» ««f f 1 *?£Sentical with tie 

— -■*«*•, J^hl retlt torn £d£ the data with .0 
crude relative ™k, 7.10, wb*h ^ m the bBshed fl, 

"r^a fm^^'^^ lhe J— * 

3to fo™ cuTnUy emoldn g 1 pack a day or lee. and for occa- 
sional or discontinued ^^ Mr ^ Mtimate9 o{ riek is 
The «V"»"f ^^ZSatiye computation, required 
also outlined in table 1. ah three estimate, are 
for tins purpose appeal ™ f^Z^ ^JLuTthat is, til. use of a 
SS^ SSStat^SS. cise and control dietributiona are 
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Derivative computations 



AD 
T 

(9) 


BC 
T 
(2) (4; 


E(A) 


E(D) 
(8) (Si 


V(A) 
(12) (13) 


NiC 
N, 

(3) (4) 


NiD 
N, 

(3) (5) 


N,A 
N, 
(D(6) 


NaB 

(2) (6) 


(9) 


(9) 


(9) 


(9)-1.0 


(6) 


(6) 


(3) 


(3) 


(10) 


(ID 


(12) 


(13) 


(14) 


(15) 


(16) 


(17) 


(18) 


0 

1.500 
2. 534 
0 


o 

0. 156 

0 

0 


0 

0.656 
0.466 
0 


7.000 
22. 656 
46. 466 
42.000 


0 

0.480 
0.380 
0 


0 

0. 280 

0 

0 


2.000 
6. 720 
9. 000 
11. 000 


0 

7. 143 
16. 333 
0 


7.000 
17. 857 
32. 667 
42.000 


1. 636 
1. 500 
1.484 
0 


0 

0. 167 
0.258 
0.333 


1. 364 
0.667 
0.774 
0.333 


4.364 
16. 667 
21. 774 
11. 333 


0. 595 
0. 483 
0. 562 
0. 222 


0.750 
0.400 
0.480 
0.500 


2. 250 

3. 600 
5. 520 
5. 500 


a 000 
10.000 
& 333 
0 


0 

10.000 
16. 667 
12.000 


0.714 
2.667 
0 

a 790 


0 

0. 056 
0.231 
0 


a 286 
1.389 
a 231 
a 211 


9. 286 
9. 389 
19. 231 
14 211 


a 204 
0.767 
0. 178 
0. 166 


0. 231 
0.385 
0. 300 
0 


.769 
4 615 
5. 700 
4 000 


13. 000 
10.400 
0 

3. 750 


0 

2.600 
20. 000 
11. 250 


12.825 


1. 201 


6.375 


224 375 


4 036 


3. 325 


6a 675 


76. 960 


172. 040 



Ri 



adjustment factor, 

= 1.0081 
fi l = r// = 7.05 



Ki = 



7.14 
8.12 



Z awea ehown are rounded from those actually calculated and consequently are 
wt fX c^ntist^r ColuZ totals and figures shown do not necessarily agree. 

adjusted If the distribution of diseased cases is taken as the standard 
distribution to which the controls are adjusted, the estimator becomes 

Estimator ft was used by Wynder et <d. in a study of the association of 
cervical cancer in women with circumcision status of sex partners {.16). 
The merit of employing the cervical cancer case-distribution as the stand- 
ard presumably rests on the fact that this distribution at least would be 
well defined by the study. 
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If the distribution of control cases is taken as standard the estimator 
becomes 

2 ( B ,xf;)zc 

If the combined distribution is taken as standard the estimator becomes 




If any N u or N 2i should equal zero, the estimator R 4 would not be 
defined, JB 3 is not defined for any zero-valued N 2i} and R* is not defined for 
any zero-valued N u . In these instances it would be necessary to exclude 
the zero-frequency categories to define the estimators. The estimator R x 
retains these categories at the expense of greater bias toward unity. The 
estimator R gives such categories zero weight, since they contain no 
information about relative risk. The chi-square significance test gives 
no weight to these categories. 

While Ra is clearly a direct adjusted estimate of relative risk employing 
the combined distribution as standard, i? 2 and i? a may be viewed alter- 
natively as either direct or indirect adjusted estimates. The same esti- 
mates will result if a direct adjustment is made using the distribution of 
cases as standard, or an indirect adjustment is made using the factor 
incidence rates for controls as the standard rates. 

It may be noted that in the example used, the values for R 2y i? 8 , and i?< 
(7.14, 8.12, and 7.91, respectively) were roughly comparable to R u and 
all were smaller than R. The example was selected because all the N u 
and Nit values were non-zero, so that the values of i? 2 , 2?s, and JJ 4 were 
all defined. 

The over-all relative risk estimates are averages and as averages may 
conceal substantial variation in the magnitudes of the relative risk among 
subgroups. Ordinarily, the individual subcategory data should be ex- 
amined, paying special attention to relative risks based on reasonably 
large sample sizes. This will provide protection against the potential 
deficiencies of any particular summary relative risk formula employed. 
The over-all chi-square significance test in any case will remain appropriate 
for detecting any strong general tendency for the risk of disease to be 
associated with the presence or absence of the test factor. 

The Matched-Sample Study 

The matched-sample study previously described can be considered a 
special case of the classification procedure with the number of classi- 
fications equal to the number of pairs of individuals. The status of pairs 
of well and diseased individuals classified with respect to the presence or 
absence of the suspect factor in each individual will be represented as 
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F, 0 } fif, or J in the following fourfold table. The meanings attached to 
the marginal totals A y B } C } and D are the same as those in the first 
schematic representation. 

Diseased individuals 

Well individuals With factor Free of factor Total 

With factor F 0 C 

Free of factor H J D 

Total A B N 

In the absence of association between the disease and the factor, we 
expect the same number of individuals with the factor to appear among 
both diseased and well individuals; that is, we expect A(=F + H) to 
equal C(=F + 0). This can occur only when 0 = H and the statistical 
test is simply whether or not 0 differs significantly from 50 percent of 
0 + H. 0 is tested as a binomial variable with parameter K, G + H 
being the number of cases. 0 thus has expectation %{0 + H), variance 
y 4 (0 + H) and the corrected chi square with 1 degree of freedom can 
readily be shown to reduce to (\G - H\ -1)7(# + 

Treating the data as consisting of N classifications each with N u = 
N u = \ f T { = 2 and applying the previously described procedures will 
lead to the same value of chi square. For F of the N classifications, 
Ai = 1, M u = 2, M u = 0, E(Ai) = 1, V(At) = 0; for G classifications 
A { — 0, M u = M 2i - 1, E(A t ) = K, V(A t ) = %; for H classifications 
At = 1, M H = M a< = 1, E(Ai) = = %; and for J classifications, 

A { - 0, M u - 0, M 2i = 2, = 0, F(4 4 ) = 0. Thus, XA t — F + H, 

ZE(At) = F + %(G + H), 2V(A t ) = Yi(Q + fl), and the resultant cor- 
rected chi square can again be seen to be (\Q-H\ — lYf(G + H). 

It is of interest to observe that the summary chi-square formula is 
appropriate in the matched-sample case, even though the frequencies for 
each of the separate subclassifications are small. Its appropriateness, 
despite the small frequencies, stems from the fact that it is a test on a 
summation of random variables, At, and thus tends to approach normality 
rapidly, making the chi-square test valid, even though the individual 
^'s are not normally distributed. This property of the chi-square 
formula applies in the general classification as well as the matched-sample 
situation. Only substantial lack of cross-matching in the general case 
would tend to make the chi-square test invalid. It is also essential, of 
course, that there be some appreciable variation in the presence or absence 
of the factor under study. 

It should be noted that in the matched-sample study with T t = 2 for 
each of the N pairs of individuals, the variances of the A/s would have 
been understated by a factor of 2, had T — 1 been replaced by Tin the 
variance formulas. The usual formula for chi square does essentially 
make this replacement, but it is usually of little consequence if T is of 
any reasonable magnitude. The formulas for relative risk in the matched- 
sample study reduce simply to the following: R = H/G; R x «= = R z = 
R< = AD I BO. 
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Study Factors at More Than Two Levels 
The preceding discussion on the analysis of retrospective data has been 
in terms of the test factor under study taking only two values This 
framework has sufficed for discussion of the underlying statistical ideas 
and issues. In practice, the study factor will frequently take on more 
than two, perhaps many, potential values. When the number of study 
factor values is large, grouping can reduce them to manageable 

^The n°eed to consider only a limited number of classes for the study 
factor stems from the fact that, when an association is anticipated, most 
of the significant information about the association will comefrom the 
results for the more extreme values of the study factor. While it is 
efficient to concentrate attention on the test factor classes expected to 
show the greatest differences in association with the disease, it is also 
profitable to consider intermediate values for the test factor to seek 
evidence for a consistent pattern of association. For example, in table 1, 
a highly significant difference between nonsmokers and women currently 
smoking more than 1 pack of cigarettes daily was illustrated. Inclusion 
of data for smokers of 1 pack or less a day showing results mtennediate 
between the other classes would have added little, if anything, to the 
statistical significance of the results, and might actually lower it, if one 
made an over-all test of the differences among the three smoking classes. 
However, the observation that the intermediate smoking class does, in 
fact show an intermediate relative risk contributes to an orderly pattern 
and'increases our confidence in the conclusions suggested by the data for 
the remaining two classes. m 

For any two particular test-factor levels, the relative risk for one over 
the other may be calculated using only the data pertaining to those two 
levels or by using the results for all test levels. In the formulas previ- 
ously given for ft ft, ft, R„ and ft, the difference between the two 
calculating procedures is simply one of setting the values of N lt , Nu, and 
T = N + N it m terms of number of cases and controls occurring at 
the two study-factor levels only, or defining them in terms of total number 
of cases and controls in the entire study. When total cases and controls 
are used in denning N u , N 2i , and T„ it can be shown that for ft, ft, ft, 
and ft the various relative risks will be internally consistent with each 
other. If the relative risk for the first level is twice that for the second 
level which in turn is twice that for the third level, then the relative risk 
for the first level will be four times that of the third. These exact rela- 
tionships do not hold for R as an estimator of relative risk, and a somewhat 
sophisticated extension of the formula for R would be required to secure 

this property. , . , . . 

The problem of obtaining a summary chi square when the study factor 
is at more than two levels is complicated by the fact that the deviations 
from expectation at the various study-factor levels are mtercorrelated. 
When there are but two levels, the two deviations will have perfect nega- 
tive correlation, and attention need be directed to only one of the devia- 
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tions Irrespective of the number of levels, at any one level the deviation 
from expectation among diseased persons will be equal, but opposite in 
sign, to the deviation from expectation among controls, so that attention 
can 'be confined to the deviations for diseased persons. 

The problem can be stated as one of reducing a set of correlated devia- 
tions into a summary chi square. Table 2 applies this process for obtain- 
ing a summary chi square to the study of the association of epidermoid 
and undifferentiated pulmonary carcinoma in women and maximum 
cigarette-smoking rate, classified into three levels, after adjustment for 

age and occupation. . , 

The general expressions for the expectations and variances of the 
number of cases at a particular test-factor level are given in the lower 
right section of table 2. Also shown is the expression for the covariance 
between the number of cases at two different test-factor levels Since 
the total of all the deviations is zero, one would in general need the vari- 
ances of, and covariances between, the number of cases at all but one of 
the levels The number of covariance terms will rise sharply as the 
number of test levels are increased. At 3 test levels there are 2 variance 
terms and 1 covariance term, while at 10 test levels, there would be 9 
variances and 36 covariance terms of interest. 

For the general case the burden of computation could be heavy. After 
all the necessary computation for the deviations, their variances and 
covariances, there would still remain the problem of converting these, 
presumably by matrix methods, into a suirmary cm square. Since the 
retrospective problem will normally involve only a limited number of 
test-factor levels, precise procedures will be given only for the three-level 
situation, and approximate procedures outUned for the general case 

The exact computation procedure for the three-level case is detaJed 
in table 2. lines (1), (2), and (4) show the total observed and expected 
frequencies and variances of the number of cases (and consols) at each 
of {he three smoking-rate levels, after adjustmg for age ^ ""P*^ 
These are the summary totals over each subclassification obtained by 
application of the formulas appearing in table 2. 

Lines (5) and (6) give the chi squares corresponding to the total devia- 
tio^m expectation at each of the smoking-rate levels. The chi squares 
Thne (5) Ire corrected for continuity. They relate to the difference 
of the particular level to which they apply, from the two other levds 
combined. Following the usual practice of making no continuity cor- 
rections when chi squares with more than 1 degree of freedom are under 
consideration, line (6) shows the uncorrected chi squares 

The computing procedure of table 2 takes advantage of the fact that, 
since the sum of the deviations from expectation is zero the variance 
of the third deviation must equal the sum of the other two variances 
plus twice the covariance for the first two deviations. The covariance 
oftiie first two deviations is readily obtained as illustrated and is used 
in calculating the summary chi square. The summary cm square is 
obtained as the sum of squares of two orthogonal deviates, with each 
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square adjusted for its own variance. The first deviate squared is simply 
the uncorrected chi square at the first level in line (6)— the variance of 
the deviate remaining as initially calculated. The second deviate is the 
deviation at the second level adjusted for its correlation with the first 
deviation [adjusted Y 2 = Y 2 -b 2 iY x ; b 2l = covariance (F^/variance 
Ft)]. The variance of the adjusted second deviate is the initial value 
reduced by that portion of the variation accounted for by the first devi- 
ation [Var. (adjusted Y 2 ) = variance r 2 -covariance 2 (F 1 ,F 2 )/variance 
Yi)]. 

In the present instance the summary chi square with 2 degrees of 
freedom is 28.43 [line (11)]. This presumably is close to the chi square 
with 1 degree of freedom which would have obtained had only the two 
most extreme smoking classes been compared. If one examines the 
individual uncorrected chi squares [line (6)], their total is found to be 
45,55, the maximum individual figure being 23.42. It will necessarily be 
trite that the summary chi-square value will lie between the largest of the three 
chi squares and their total. At almost any reasonable probability level these 
limits would be sufficient to establish statistical significance without further 
calculation. In our companion paper (27) this rule sufficed in almost all 
instances to separate the significant from the nonsignificant results. 

Comments on Extensions to More Than Three Factors 

Two procedures can be suggested for getting approximate summary 
chi squares, when there are a large number of levels for the test factors, 
without the burden of computation that the exact method would entail. 
Both methods calculate the approximate summary chi square as a sum of 
squares of approximately orthogonal standardized deviates. 

In the first method one computes an uncorrected chi square w' ih 1 degree 
of freedom for the difference of the first level from all the remaining levels 
combined (the same first step as in the illustration for the three-level case). 
Discarding the data from the first level, a second chi square is computed 
for the difference between the second test-factor level and the remainin g 
levels combined. This is done successively up to and including the last 
two remaining levels. The approximate summary chi square is then the 
sum of the separate chi squares with the number of degrees of freedom 
being one less than the number of test levels. 

Exactly orthogonal standardized deviates would be obtained if, in the 
summary analysis, as each successive total deviation from expectation 
were evaluated, it was adjusted for its multiple regression on the preceding 
deviations, and then standardized by the adjusted variance. This, of 
course, would no longer be a simplified approximate procedure. However, 
it can be shown that for a single classification, in the multiple regression of 
any deviation from expectation on any subset of deviations, the regression 
coefficients will all be equal; the multiple regression on the set of deviations 
will be the same as the simple regression on their sum. The equality of 
regression coefficients, while holding true exactly for deviations in the 
separate subdassifications, will hold only approximately for the total 
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deviations from expectation (it would hold exactly if equal numbers of 
individuals were observed from level to level at each subclassification). 
Nevertheless, this result suggests that approximately orthogonal deviates 
would be obtained if, in evaluating each successive total deviation, it were 
adjusted for the cumulative total of deviations already evaluated. Com- 
puting procedures to accomplish this can readily be devised. 

Both approximate chi-square procedures just outlined, which may have 
merit when more than three groups are being compared simultaneously, 
should, in theory, yield linear combinations of independent chi squares. 
While testing the chi-square values obtained as though they were exact 
is not likely to be too inappropriate, it may be more correct to obtain a 
modified number of degrees of freedom, along the lines suggested by 
Satterthwaite (47) for problems involving such linear combinations. 
What the modified number of degrees of freedom would be has not been 
investigated by us, and it may prove as easy to apply the exact chi-square 
procedure, indicated later, as to determine the appropriate degrees of 
freedom for the approximate chi square. 

It is of interest that a somewhat similar task of obtaining an appropriate 
summary chi square appears in the birth-order problems described by 
Halperin (48). There, it was necessary to compare a set of total observa- 
tions (across family sizes) with a set of total expectations, one for each 
birth order. Halperin described a matrix-inversion procedure for reducing 
the set of correlated deviations into a summary chi square. In that 
problem it can be shown that all the regression coefficients are equal in 
the multiple regression of the deviation at a particular birth order on the 
set of deviations at all succeeding birth orders. The second approximate 
method described previously for the present problem could thus be used 
exactly for the birth-order problem, permitting simplified computation of 
chi square. The procedure indicated by Halperin has the advantage of 
generality and could be applied to the current and related problems, if 
one obtained all the necessary variances and covariances and inverted 
the resulting matrix. 
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Abstract 

A novel variation of the Haplotype Relative Risk (HRR) of 
Rubinstein et aL [Hum Immunol 198l;3:384] is proposed, in or- 
der to glean increased information about linkage disequilib- 
rium or allelic associations by analyzing haplotype -based data 
rather than genotypic data. It is shown that statistical tests 
based on our design give much higher power than those based 
on the original HRR approach. Several additional nonpara- 
metric tests based on the same data are analyzed, and power is 
computed for each of them. Further, parametric likelihood 
methods are applied to testing linkage equilibrium, and esti- 
mating 8, the coefficient of linkage disequilibrium, from the 
same data. 



Introduction 

Allelic associations between etiologically 
unrelated traits were originally detected in hu- 
mans through observations at the genotypic 
level. In the 1950s, it was noticed that in indi- 
viduals with certain diseases there were signif- 
icant excesses of certain blood groups. Aird et 
aL [1, 2] demonstrated the presence of a signif- 
icant association between blood group A and 
stomach cancer, and between blood group O 
and peptic ulcer, while Pike and Dickens [3] 
found such an association between blood 



group O and toxemia of pregnancy, and 
McConnell et al. [4] studied associations be- 
tween blood groups and carcinoma of the 
lung. Woolf [5] then proposed his Relative 
Risk statistic to compare the incidence rates in 
given blood groups in a case control type of 
study, in which one would collect a sample of 
people with the disease and compare the ob- 
served frequency of the 'risk allele' with its fre- 
quency in a separate sample of healthy indi- 
viduals (or population frequency, if known). 

One problem with this method is that there 
is no way of knowing whether a significant re- 
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suit is biologically meaningful or just a conse- 
quence of having the case and control, samples 
taken from different genetic populations in 
which the frequency of the risk allele is differ- 
ent and therefore, no real association exists. 
To attempt to circumvent this problem, Ru- 
binstein et al. [6] proposed the Hapiotype Rel- 
ative Risk (HRR) statistic, based on earlier 
work of C.A.B. Smith, to ensure that the con- 
trol and disease samples were well-matched, 
from the same population, so that any ob- 
served association would have to be due to a 
real allelic association of some sort. This ex- 
perimental design has also been used in the 
hapiotype frequency difference statistic of 
Seuchter et al. [7]. 

Experimental Design 

H « Marker allele with which disequilibrium is hy- 
pothesized. 

H m Any allele other than H at the marker locus. 
5 = Gametic linkage desiquilibrium coefficient; 
= P(AB gamete) - P(A)P(B) (A at one locus B 
at the other). 
0 = Recombination fraction between marker and 

disease loci, 
p = Gene frequency of the disease allele, 
q = Gene frequency of the H allele, 
n » Sample size. 

In order to be sure one has matched control and 
disease samples, Rubinstein et al. [6] proposed using 
data from nuclear families with one affected offspring 
lo test for deviations from linkage equilibrium. They 
recommended using the affected offspring's genotype 
(made up of alleles transmitted from parents to the af- 
fected child) at a marker locus as the 'case* sample, and 
an artificial genotype made up of the alleles not trans- 
mitted to the child from its parents as the 'control' sam- 
ple in an association test. Then they used such data to 
test whether the H allele was present equally fre- 
quently in diseased individuals' genotypes, and the 
nontransmitted control genotypes. For example, in a 
family with unaffected parents with genotypes G/H 
and I/J at the marker locus, and an affected child with 
marker genotype H/I, the transmitted genotype would 
be H/I, and the artificial nontransmitted genotype 
would be G/J. Since they were only interested in 



Table 1. Data collected in a hapiotype relative risk 
study (either HHRR, or GHRR) 



Transmitted 


Not transmitted 


Total 




H 


H 




H 


A 


B 


W 


H 


C 


D 


X 


Total 


Y 


Z 


N 



In the 2x2 table shown here, each cell corre- 
sponds to one parent. In the HHRR, each parent 
transmits one allele, and not the other, and can thus 
be classified by which allele was, and which was not 
transmitted to the affected offspring. In the GHRR, 
each set of parents has 4 alleles, 2 of which are trans- 
mitted to the affected child, and 2 which are not. If 
the child contains 1 or 2 H alleles, we say H was trans- 
mitted, and if there is an H allele in the remaining 2 
alleles, we say that H was nortransmitted. Thus, each 
family cither transmits H or H, and has either H or H 
among the nontransmitted alleles, and can therefore 
also be characterized by one cell of this table. 



Table 2. Hapiotype relative risk 





H 


H 


Total 


Transmitted 


W 


X . 


N 


Not transmitted 


Y 


H 


N 


Total 


W + Y 


X + Z 


2N 



The data in this table are taken directly from the 
marginals of table 1, and represent the form of the 
originally proposed GHRR statistic. This table, of 
course, can be filled with either hapiotype- or geno- 
type-based data. All variable names are the same as 
in table 1. 



whether H was present or absent from the genotypes, 
in this example we have H transmitted, and H not 
transmitted (genotype G/J does not contain H). For ev- 
ery such nuclear family there would be one such obser- 
vation. One can then tabulate such observations in the 
form of table 1. The example family above would fall in 
cell B. Ott [8] demonstrated that under the null hy- 
pothesis of 6 = 0, the transmitted and nontransmitted 
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Haplotype-Based HRR 



"a hapiotype relative risk 
RR) 



ansmitted 


Total 


H 




B 


W 


D 


X 


Z 


N 



l here, each cell corre- 
ic HHRR, each parent 

the other, and can thus 
was, and which was not 
>ffspring. In the GHRR, 
les, 2 of which are trans- 

and 2 which are not. If 
eles, we say H was trans- 
allele in the remaining 2 
^transmitted. Thus, each 
-f, and has either H or H 
llelcs, and can therefore 

cell of this table. 



/e Tisk 


H 


Total 


X 


N 


H 


N. 


X + Z 


2N 



taken directly from the 
present the form of the 

statistic. This table, of 
her hapiotype- or geno- 

names are the same as 



»ent from the genotypes, 
transmitted, and H not 
;s not contain H). For ev- 
vould be one such obser- 
such observations in the 
amily above would fall in 
that under the null hy- 
tted and nontransmitted 



alleles are independently associated, and thus we can 
treat our transmitted and nontransmitted samples in- 
dependently and represent them in the form of table 2 
(marginals of table 1). Then a standard x 1 test of inde- 
pendence on this table can be shown to be a valid % 2 test 
of the hypothesis 8 = 0. This is the test proposed by Ru-, 
binstein et al. [6] to guarantee the control and disease 
samples are genetically well-matched. 

As is shown below, the statistical method of Rubin- 
stein et al. [6] does not take advantage of all the in- 
formation present in the_data. Their method lumps 
H/H homozygotes and H/H heterozygotes together as 
H genotypes. However, since under the null hypothesis 
the two parental genotypes are independent, it is pos- 
sible to treat each parent as an independent observa- 
tion, and merely look at the fate of each parental, 
marker allele. So, in the example family above, there 
would be one observation of H transmitted, G not 
transmitted, and one observation of I transmitted, J 
not transmitted, which in table 1 (now referring to al- 
leles, not genotypes), would contribute one observa- 
tion to cell B, and one observation to cell D. Again, for 
theoretical reasons given by Ott [8], transmitted and 
nontransmitted alleles are independent for each other, 
and can be collapsed, as in the Rubinstein case, into ta- 
ble 2, in which the example family would contribute 
one observation to cell W, one to cell X, and two to cell 
Z, the marginal values of table 1. We are thus using 
more of the information present in the family, obtain- 
ing twice as many observations from the same amount 
of data. 

Recessive Disease 

Haplolype-Based versus Genotype-Based 
HRR% 2 Tests 

We first compared the power of our hapio- 
type-based HRR (HHRR) statistic with the 
genotype-based HRR (GHRR) of Rubinstein 
et al. [6]. The test we applied to each data set is 
essentially a y} test of independence on table 2 
for the haplotype-based data (HHRR test), 
and for the equivalent genotype-based table 
(GHRR test) in which discrimination is be- 
tween genotypes with no H allele, and those 
with at least one (possibly two). Power calcula- 
tions were performed for each test, assuming a 
recessive disease with no phenocopies (pene- 
trance is irrelevant to the calculations, accord- 



ing to Ott [8]), by analytically computing the 
probability of a significant y} test result (% 2 
> 3.84 at the 0.05 level) for different combina- 
tions of 5/p (5 and p are completely con- 
founded according to Ott [8]), q, and 0. Power 
curves for these two tests (n = 100 families, q = 
0.5) are given in figure 1 for varying true values 
of ,0 and 8/p. In all the numerical cases we 
considered, the HHRR test was more power- 
ful than the GHRR approach of Rubinstein et 
al. [6]. This is intuitively satisfying, since the 
HHRR approach discriminates between H/H 
homozygotes and H/H heterozygotes, while 
the GHRR does not. Thus, our approach uses 
all of the information in the data, where the 
traditional GHRR does not. 

The test of independence on table 2 is a test 
of E[W] = E[Y]. However, W and Y are ob- 
tained from the marginals of table 1. So, when 
we are testing E[W] = E[Y], we are essentially 
testing E[ A + B] = E[A + C], which is the same 
as E[B] « E[C]. Clearly this is expected under 
the null hypothesis of no disequilibrium. Us- 
ing the data from table 1, the HHRR y} is com- 
puted as 

~2N(B-C) 2 

. (2A + B + C)(N-2A-B-C) 

2N(WZ-XY) 2 
= (W + X)(W + Y)(X + Z)( Y + Z) ' 

the standard y} test of independence on a 2 x 2 
table. This is a valid y} test, of the form (B-C) 2 / 
Var[B-C], since Var[B-C] = 2Nq(l-q), which 
is estimated by 2N[(2A+B + C)/(2N)][1-(2A 
+ B + C)/(2N)]. The power is shown graphi- 
cally in figure 2 for n = 50 families (for com- 
parison with other haplotype-based tests be- 
low). 

McNemar Tests 

Since our null hypothesis is B = C in a 
paired sampling (transmitted allele, nontrans- 
mitted allele) test, one's first intuition might 
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Power 



Fig. 1. Power curves (analyti- 
cally computed) for x 2 tests based 
on the haplotypc- ( — ) and geno- 
type-based ( ) HRR designs 

(100 families), for q - 0.5. If p - 0.5, 
then all values of 6/p shown are 
possible. For other values of p, dif- 
ferent restrictions apply, but have 
no effect on the power curve. The 
upper two lines are for the power of 
the test when 0 = 0, and the lower 
set of two lines correspond to 0 « 
0.20. Note that the haplotype-based 
design yields higher power for all 
true values of 0 and 8/p. 
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Pow«r 



Fig. 2. Power curves (analyti- 
cally computed) for the HHRR test 
(50 families) for q « 0.5, with 0 = 0 
(upper curve), 0.2 (middle curve) 
and 0.5 (lower curve). 
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be to apply a McNemar test, (B-C) 2 /(B + C). 
In order for this to also be a valid y} test, 
(B + C) would have to be an estimate of the 
variance of (B-C), which we already have 
shown to be 2Nq(l-q). Our HHRR y} test uses 
all the data to estimate q, including the infor- 
mation from homozygous individuals, while in 
the McNemar test, all homozygotes are ig- 
nored, and the variance is estimated as 
(B + C). Clearly E[C] - E[B] « Nq(l-q) under 
the null hypothesis (5 = 0), so (B + C) then es- 



timates 2Nq(l-q). However, in every numer- 
ical case we considered, this test was less pow- 
erful than the HHRR test, as shown in figure 
3, due to the fact that the HHRR uses all of 
the data to estimate the variance, while the 
McNemar uses only the information from het- 
erozygous parents. 

Independence Tests 

An interesting result of Ott [8] is that trans- 
mitted and nontransmitted alleles are inde- 
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Fig. 3. Power curves (analyti- 
cally computed) for the haplotype- 
based McNemar (HMCN) test (50 
families) for q = 0.5, with 0 ~ 0 
(upper curve), 0.2 (middle curve), 
and 0.5 (lower curve). 
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Fig. 4. Power curves (analyti- 
cally computed) for the haplotype- 
based independence lest (HIND) 
for 50 families, q = 0.5, and 0 - 0 
(lower curve), 0.2 (middle curve), 
and 0.5 (upper curve). 
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'er, in every numer- 
lis test was less pow- 
t, as shown in figure 
1 HHRR uses all of 
variance, while the 
formation from het- 



:Ott[8] isthattrans- 
sd alleles are inde- 



pendent when 8 - 0 or when © = 0. In light of 
this, one could use an independence test on ta- 
ble 1 as a test of 5 = 0, though clearly when 0 is 
close to 0, this test should not be useful. This 
test is just that (AD-BC) = 0. Therefore, 
the test should be (AI>-BC) 2 /Var( AD-BC), 
which is the standard x 2 test of independence 
on a 2 x 2 table, N(AD-BC) 2 /(WXYZ). Power 
was analytically computed for this test, under 
the recessive model, for various true values of 
q, 5/p, and 0, which are graphically presented 



in figure 4. In this test, the power increases as 
0 increases, just the opposite behavior from 
the HHRR and McNemar tests. This test may 
thus be a useful way to use such nuclear family 
data to test 5 = 0 when 0 is known to be quite 
large, since when 0 « 0.5, the HHRR tends to 
0[8]. 

This independence test, however, fails to 
impose the restriction that the frequency of 
the H allele be equal in both the transmitted 
and nontransmitted samples. To include this 
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Power 



Fig. 5. Power curves (analyti- 
cally computed) for the test of fit to 
the expected multinominal propor- 
tions (HIID) of haplotype-based 
data for 50 families, q = 0.5, and 
0 « 0 (upper curve), 0.2 (middle 
curve), and 0.5 (lower curve). 
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information, one could test the fit of the 
counts of A, B, C, and D to their expected mul- 
tinominal proportions (each observation is 
clearly independent) as follows: E(0-E) 2 /E, 
which is equal to 

(A-Ng; 2 ) 2 [B-Nq(l-q)] 2 [C-Nq(H)] 2 
Nq 2 Nq(l-q) + Nq(l-q) 

[D-N(H) 2 ] 2 2A + B + C 

+ , where q = . 

N(I-4) 2 2N 

This test follows a % 2 distribution with 2 df, 
since we had 4 cell counts, but fixed the sum 
A + B + C + D = N, and estimated q from the 
data. This test is very powerful over a large 
range of values of S/p, q, and 0, as shown in 
figure 5, and thus provides a useful general 
test for disequilibrium. 

Relative Power of Nonparametric 
Approaches 

Each of the tests described above has dif- 
ferent properties which make it useful. How- 
ever, the question remains as to which test 
should be used in a given situation. To answer 
that question, for each combination of 0, 5/p, 




Fig. 6. Graph showing, for all possible values of © 
and 8/p, and fixed q = 0.5, which among three tests is 
the most powerful (50 families). The values of the 
power are not shown, but are given in fig. 2-5 (HMCN 
is never the most powerful). 



and q, we determined which test gave maximal 
power for a sample size of 50 families. The re- 
sults are presented graphically in figure 6. In 
this figure, for fixed q, we considered all pos- 
sible combinations of 8/p and 0, and deter- 
mined which test gave maximal power (analyt- 
ically computed). Then for each point (8/p, 0) 
the most powerful test is indicated. To see ex- 



342 



Tcrwilliger/Ott 



Haplotype-Based HRR 



actly what the power was, the reader is re- 
ferred to the power curves already presented 
for each test. Some interesting patterns can be 
seen in this figure, but it should be used only in 
conjunction with the actual values of the. 
power shown in figures 2-5, for often the dif- 
ference is small between tests. However, over 
the most relevant ranges of 8/p and 0, for all q, 
the HHRR test is the most powerful. In light 
of this, and the relative implausibility of strong 
disequilibrium when 0 is large, the HHRR 
test should be the general nonparametric test 
of choice, both for its power, and its simplicity. 

Parametric Likelihood Ratio Tests 
If one knows the model of the disease, one 
could do a parametric likelihood ratio test 
analysis, based on theoretical probabilities of 
each type of parent under a fixed model. Table 
2 of Ott [8] provides such parametric values 
for the case of a recessive disease. The diffi- 
culty here is three fold. First, one needs to 
have an accurate parametric model for the 
disease, and compute the parametric proba- 
bilities of each cell of table L This process is 
very tedious (except for the recessive model 
described by Ott {8]), and depends heavily on 
the disease model. Secondly, one needs to 
maximize the likelihood of the data over all 
the parameters, 0, (S/p), and q, and then again 
maximize the likelihood, fixing 8 = 0. This 
would give us the following likelihood ratio: 
L(S/p, q)/L(8/p = 0, 0, q). Normally, one 
can treat 2x In(LR) as a y} random variable, 
with the number of degrees of freedom being 
the difference in free parameters in numer- 
ator and denominator of the likelihood ratio, 
which would appear to be 1 in this case. How- 
ever, when 8 = 0, 0 disappears as a parameter, 
as shown by Ott [8]. When a parameter dis- 
appears under the null hypothesis, it is a de- 
generate situation, and so the statistic does 
not satisfy the criteria for x 2 . As the distribu- 
tion is unclear, this test becomes very awkward 



to interpret, and presents a situation analo- 
gous to the degenerate likelihood ratio test for 
linkage in the presence of heterogeneity [9]. 
For this reason, combined with the enormous 
computer time involved, power was not calcu- 
lated for this approach. 

For general pedigree data (including nu- 
clear families with multiple offspring), with a 
fixed-disease model, parametric likelihood ra- 
tio tests are tractable using any linkage analy- 
sis program, like ILINK of the LINKAGE 
package. One need only maximize the likeli- 
hood over 0, q, p, and 8 for the numerator, 
and again maximize the likelihood for the de- 
nominator over 0, q, and p, fixing 8 = 0. This 
would then be a valid, and powerful general li- 
kelihood ratio test of 8 = 0, 2 x ln[L(£>, 8, p, q)/ 
L(£>, 8 = 0, p, q)J. It is important to remember 
that when using this method, the maximum li- 
kelihood estimates of the haplotype frequen- 
cies will reflect the sample frequency of the 
disease allele, which is not an accurate reflec- 
tion of its population frequency. One must be 
sure to weight disease and control haplotypes. 
accordingly. For example, if our haplotype fre- 
quency estimates are P(Hd), P(Hd), P(HD), 
P(HD), and we know the true gene frequency 
of the d allele, p d , we can compute adjusted 
haplotype frequency estimates as 

/ f»(Hd) \ 

and so on. Similarly, if one wanted to estimate 
the coefficient of disequilibrium from such 
ILINK estimates, it would be necessary to use 
the adjusted estimates described above, yield- 
ing an adjusted estimate of 

PoO-Pd) 

where 8 = P(Hd)_f(HD) - P(HD) P(Hd), and 
p d = P(Hd) + P(Hd). 
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An ad hoc method sometimes used in gen- 
eral pedigrees is to assume the absence of re- 
combination, and determine the haplotypes of 
each founder, between marker and disease, as 
a way to insure the control (nondisease) ha- 
plotype are from the same genetic population 
as the disease haplotypes. This ad hoc ap- 
proach has been applied, for example, in cystic 
fibrosis [10]. It assumes an absence of recom- 
bination, and its statistical properties are, in 
general, unclear, especially in cases where 0 is 
actually greater than zero. Another problem is 
that it is not always possible to uniquely and 
accurately determine ail founder haplotypes. 
Censoring such indiscernible cases in some in- 
stances can be shown to lead to a statistical 
bias. In light of all of this, if one wants to use 
general pedigree data to test and quantify dis- 
equilibrium, the likelihood ratio test with 
ILINK described above is the test of choice, as 
it is more general and powerful, and has well- 
characterized statistical properties. 

Nonrecessive Case 

All of our results above were obtained for 
the case of a recessive disease. However, when 
other more complicated models prevail, the 
situation becomes unclear. While under any 
model we choose for the disease, the above 
tests are valid tests of 8 = 0 (since this implies 
no association between the disease and the 
marker locus), the effect on the power of our 
testing procedures is not so clear. When deal- 
ing with a recessive disease, a lot of additional 
information about linkage disequilibrium is 
obtained by looking at each parent separately, 
since each parent transmits a disease allele to 
the affected offspring, but the situation is less 
clear when there is a different model. For a 
dominant disease, with one affected parent, 
and one affected child, one can just consider 
the affected parent, and his or her transmitted 



and nontransmitted alleles, and base a test on 
the same procedure as above. The effect 
would be that there would be only one obser- 
vation per family instead of two in the reces- 
sive case (where we know the parents to be 
heterozygous for the disease), and there is 
possible noise when the unaffected parent ac- 
tually transmits the disease to the offspring, 
though this should be very rare. 

In the case of dominant reduced-pene- 
trance disease, in which neither parent is af- 
fected, clearly at least one parent must carry 
the disease-predisposing allele, though we 
cannot discern which one. In this situation, 
one parent will transmit the disease allele (in 
putative disequilibrium with the marker), and 
the other parent will transmit the normal al- 
lele. This adds noise to our system. One would 
expect the Rubinstein method to be less sensi- 
tive to this noise, since it doesn't distinguish 
between heterozygotes and homozygotes for 
the H allele. 

Power calculations were approximated for 
this situation by simulation. A simplified 
model was considered in which one paren t was 
forced to transmit the disease allele to ,the af- 
fected child, while the other parent was as- 
sumed to be homozygous unaffected (a rea- 
sonable assumption for small p). In this case, 6 
and p are no longer completely confounded, 
so we had to treat p, q, 8, and 0 as separate pa- 
rameters. Then, 20,000 sets of 100 such nuclear 
families with 2 unaffected parents and one af- 
fected offspring were simulated under various 
assumptions on p, q, 8, and 0. For each set of 
100 families, the HHRR and GHRR were cal- 
culated. Then the number of significant re- 
sults for each test at the 0.05 level (x 2 ^3.84) 
was counted to estimate the power of each 
test, which is graphed in figure 7. An interest- 
ing situation arises here, where the HHRR is 
much more powerful for negative values of 8, 
but for positive values of 8 they are just about 
equal in power, with the GHRR being slightly 
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Fig. 7. Power curves (simu- 
lated) for the HHRR ( — ) and 
GHRR ( ) tests with a dom- 
inant disease (reduced penetrance) 
and two unaffected parents, forq = 
0.5, p = 0.01, and 100 families, based 
pn 20,000 replicates. The upper 
curves represent 0 = 0, and the 
lower curves 0 « 0.2. In most cases, 
the HHRR is shown to be much 
more powerful than the GHRR. 



more powerful for very extreme values of 8. 
The HHRR test is also more powerful than 
the other haplotype-based nonparametric 
tests over most of the reasonable sample 
space. The HHRR is more powerful than the 
GHRR in all recessive situations, dominant 
situations with 6 < 0, and about equally power- 
ful with the GHRR in dominant situations 
with extremely positive 8. Further, the HHRR 
can take advantage of dominant situations 
with one affected parent, while the GHRR 
cannot. Therefore, we recommend using the 
HHRR as the nonparametric test of choice in 
general 



Discussion 

When doing an association study, it is often 
difficult to find genetically well-matched cases 
and control samples. The HRR approach of 
using transmitted and nontransmitted alleles 
from the same parent as case and control sam- 
ples ensures that they are genetically well- 
matched [11]. Further, the case and control 
samples are shown to be independent under 
the null hypothesis of 8 = 0. In light of this, 
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HRR-type methods should be increasingly 
more important as geneticists try to map com- 
plex diseases, by looking for associations with 
candidate genes for example. In such a case, if 
the candidate gene is correct, 0 would be 
equal to 0, and these methods would achieve 
maximal power to detect the associations. 
Further, the built-in genetic control should 
provide a solution to the often difficult task of 
finding a valid control sample, and should al- 
low people to have more faith in the validity of 
such association studies. 

The approach presented here extracts fur- 
ther information about disequilibrium from 
the data used in the original GHRR approach, 
and thus presents a more powerful way to de- 
tect such associations in the absence of a para- 
metric model. Given a parametric model, two 
likelihood-based methods were discussed as 
well. However, from the results of our power 
calculations, our HHRR seems to be the best 
general nonparametric test considered for de- 
tecting such associations with this experimen- 
tal design over the most biologically plausible 
ranges of 8 and 0. 
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Haplotype relative risks: an easy reliable way to construct a proper 
control sample for risk calculations 

C. T FALK and P. RUBINSTEIN 

The Lindsley F. Kimball Research Institute of The New York Blood Center, 310 E. 67th SL, 

New York, NY 10021 

SUMMARY 

An alternative to Woolf 's (1955) relative risk (RR) statistic is proposed for use in calculating 
the risk of disease in the presence of particular antigens or phenotypes. This alternative uses, 
as the control sample, the parental antigens or haplotypes not present in the affected child. The 
formulation of a haplotype relative risk (HRR) thus eliminates the problems of sampling from 
the same homogeneous population to form both the disease sample and an appropriate control. 

We show that, in families selected through a single affected individual, where transmission 
of the four parental haplotypes can be followed unambiguously, the mathematical expectation 
of the HRR is identical to that of the RR. Since the sample formed from the 'non-affected ' 
parental haplotypes is clearly from the same population as the disease sample, the HRR thus 
provides a reliable alternative to the RR. A further advantage obtains when family data are 
being collected as part of a study since the control sample is then automatically contained in 
the family material. 

Data from studies of patients with insulin dependent diabetes mellitus (IDDM) are used to 
obtain an estimate of the risk to those with HLA antigens or phenotypes associated with IDDM 
using the HRR statistic. A comparison of the HRR's and RR's for these data is also presented. 

INTRODUCTION 

Relative risks have been used for some time to estimate the increased risk of contracting a 
disease, given that a certain condition (or trait) is present, over that of the group lacking the 
condition. This formal definition of a relative risk requires prospective information that is not 
easily obtained and the relative risk is often approximated by the more easily obtained cross 

pr ° duCt Pr(Qlaff)Pr(g|control) 

Pr(g|aff)Pr(0| control)' 

where Q stands for the presence of the condition or trait and q for the lack of the condition, 
and the four terms are conditional probabilities as indicated. When the overall frequency of 
the disease in a population is low, this estimate will closely approximate the true relative risk. 
This odds ratio was proposed by Woolf (1955) to estimate the risk of contracting either peptic 
ulcers or stomach cancer for individuals of particular ABO phenotypes. Since then it has been 
used to calculate risks for genetic markers associated with many diseases and its most notable 
use has been in studying several HLA-associated diseases such as insulin dependent diabetes 
mellitus (IDDM), coeliac disease, multiple sclerosis and ankylosing spondylitis. Several assump- 
tions are generally made about the underlying population from which both the disease sample 
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and the control sample arc obtained, most importantly that both samples are drawn from the 
same genetically homogeneous population in an unbiased way. By this we mean that the disease 
sample should be selected on a clear-cut ascertainment criterion, e.g. randomly chosen affected 
individuals with no bias pertaining to other factors, and the control sample should be a strictly 
random sample from the same genetic population. In practice, this latter criterion is rather 
difficult to fulfil and most often the control is created from conveniently available data drawn 
from a population thought to be somewhat closely related to that from which the disease sample 
was drawn. 

Several years ago we proposed (Rubinstein et al. 1981) an alternative method for obtaining 
the control sample for relative risk (RR) estimations that eliminated the problems of sampling 
from a single homogeneous population. This method used, as a control, those parental 
haplotypes not present in the affected child and was therefore called the haplotype relative risk 
(HRR). This method has several appealing features including freedom from collection of proper 
control samples. Additionally, where families are to be studied anyway, collection of the family 
data automatically includes collection of the necessary control sample. It is, however, necessary 
to demonstrate that the HRR estimate has the appropriate characteristics. In this paper we 
will show t hat . assuming the ; ideal ' conditions inherent in the definition of RR, namely, control 
and disease samples both randomly chosen from the same homogeneous random mating 
population, the expected value of the HRR is identical to that of the conventional RR. We 
will then illustrate its use in the estimation of risks for HLA antigens and phenotypes associated 
with ID DM. 

THE MODEL 

Consider a set of families that has been ascertained through a single affected child, where the 
relevant disease locus is closely linked to a normal polymorphic genetic marker (e.g. HLA) and 
where certain alleles (antigens) are associated with the disease. For purposes of concreteness, 
we will assume that the disease is recessively inherited, although the same arguments hold for 
dominance and for other inheritance models as well. Assume that the HLA haplotypes present 
in ihr parents can be followed unambiguously in transmission to the offspring and designate 
the two inherited by the affected child as V (paternal) and V (maternal). Thus haplotypes 
a and r are assumed to carry the disease allele, say 'n\ In the special case where the child 
as well as both parents are ac, it is not certain whether the child gets the a from the mother 
or the father. However, it is still known that one a and one c haplotype were transmitted to 
the affected child, and thus carry the n allele, and that the haplotypes not passed on to the 
affected child were also a and c. The latter can therefore be included in the 4 random sample' 
as described below. Now if we have truly obtained our sample as a random, singly selected 
sample, the two parental haplotypes not transmitted to the affected child (say b and d) will 
represent a random sample of haplotypes from the population at large and will thus carry the 
disease allele (/<) or the normal allele (A 7 ) with probabilities equal to the allele frequencies in 
the population (say p x and p 2 , respectively, p 1 +p 2 = 1 ). The validity of this observation requires 
compliance with certain other assumptions including (1) that the parents are not inbred, (2) 
that there is no correlation within or between parental phenotypes and (3) that there is no 
differential fertility of the disease phenotypes. 

Xow assume that an antigen <? ? at the HLA locus is in positive linkage disequilibrium with 
n. the disease allele. We wish to calculate the relative risk to carriers of Q of contracting the 
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disease. We will use as our control population the set of and *'d' haplotypes from our sample 
of disease families (that is, those haplotypes within a family not carried by the single affected 
proband). Using this control we will then calculate the conventional cross product odds ratio 
given above to obtain the haplotype relative risk (HRR). Define the relevant population 
frequencies as follows : 

AQ) - 9v 

/(<?) = Qt — 1 ~9i (where q represents all other alleles), 
f{n)-p v 
f(N) = Pt =l- Pl , 
f(Qn) = x 1 =p 1 q 1 +8, 
f(QN) = x t = p tqi -8, 
f(qn) = x 3 =p 1 q 2 -8, 
f(qN) = x i =p 2 q i +8, 

where 8 is the measure, of disequilibrium between n and Q. 

We now need the four conditional probabilities necessary for the odds ratio. For the affected 
sample these are the same, regardless of how we choose our control. 

p\-A 
~ Pi ' 

Pr(„otg|aff) = ^ 5 = | 

Now since the control haplotypes will be a random sample from the population, the conditional 
probabilities will be: p r((?|control) = j.^,, = 

Pr(not Q | control) = (x z + z 4 ) 2 = q\. 
Thus the estimate of the HRR is : 

Pr(Q)aff) Pr(not Qlcontrol) 
Pr (not Q | aff) Pr(<2 1 control) 

which is identical to the equivalent expression for the conventional RR. 



EXAMPLE 

Using data collected for the 9th HLA Workshop (Bertrams & Baur, 1984) we looked at the 
sample of families, submitted for study, where a single child was affected with IDDM and where 
the ethnic background was caucasoid (Western European or North American). The patients 
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Table 1. DR phenotypes of IDDM disease sample, simplex cases 



DR type 


No. obs. 


No. exp. 


DR3, 3 


6 


7*8 


DR3, 4 


25 


182 


DR4, 4 


4 


107 


DR3. X 


16 


191 


DR4, X 


29 


22'4 


DRX. X 


10 


117 


Total 


90 


8 9 ;9 



p{DF3) = 0 294: p(DR4) = 0344 : p(DRX) = 0361 ; a/fi = (0-278)/(fr202) = 1 38. 
Table 2. DR phenotypes of control sample consisting of non-affected parental haplotypes 



DR type 


No. obs. 


No. exp. 


DR3.3 


o 


077 


DR3, 4 


2 


123 


DR4. 4 


O 


0'49 


I)R3. X 


»3 


1226 


DR4, X 


IO 


976 


DRX. X 


48 


4849 


Total 


73 


73-00 



p(DR3) = 0103: p(DR4) i = 0082; P(DRX) = 0 815; * 2 = 1*79, 2. d.f. 

were categorized, with respect to their HLA DR phenotypes using three distinct allelic groups 
DR3. DR4. and DRX. where DRX represents all other DR antigens except DR3 and DR4. The 
results are shown in Table 1 with estimated allele frequencies and observed and 'Hardy- 
Weinberg expected ' numbers for each phenotypic class. The a/ ft ratio of Falk et ai, (1983) was 
also calculated and found to be 138. This ratio relates the observed frequency (a) of, say the 
DR3,4 phenotype, to the Hardy- Weinberg expected frequency \fi - 2p(DR3)p(DR4)] in a 
sample of diseased individuals (Table 1). A value in excess of 1-0 is an indication that the 
associated suscept ibility locus does not show a simple dominant or recessive mode of inheritance 
with a single susceptibility allele. The value of 1*38 found here is characteristic of samples of 
1 DDM individuals where an excess of DR3, 4's is often observed thus suggesting a more complex 
mode of inheritance for susceptibility (Falk, 1984). The 'control group' was made up of the 
parental haplotype pairs not present in the affected child (only families in which all four HLA 
haplotypes could be followed were used). There were 146 parental control haplotypes. The allele 
frequencies for DR3, DR4, and DRX in this group were 0103, 0082, and 0-815 respectively. 
These values agree remarkably well wit h the total frequencies obtained for the * random mating 
population ' comprising all caucasoid random individuals submitted to the 9th HLA Workshop 
(Raur et al 1984) (see. e.g. the table on page 694, where the DR marginal frequencies are 0-122, 
0-129. and 0 749 for the same three DR alleles). If the control haplotypes from each family 
are assumed to be a control individual', we obtain a control population sample of 73 which 
is in H-W equilibrium (# 2 = 179, 2 d.f., see Table 2). 

In Table 3, we compare the HRR's for DR3 and DR4 to the RR's calculated using a 
contrived control population ' from the 9th HLA Workshop population data referred to above. 
This population' is assumed to be in H-W equilibrium and our 'random sample' is of the same 
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Table 3. HRRs and RRsforthe DR3 and DR4 antigens in a sample of simplex WDM patients 

(The control for the HRR's is the sample of parental haplotypes not present in the affected individuals. 
The control for the RR's was obtained by treating 1 a H-W sample assuming the antigen frequencies 
recorded for the 9th HLA workshop (Baur et al. 1984).) 



HRR 

DR 3 
+ - 

Disease 47 43 9© Disease 

control 15 58 73 control 

62 101 163 
HRR = 4 23 

p - 2 6 X IO" 5 

DR 4 
+ - 

Disease 58 32 00 Disease 

control 12 61 73 control 

70 93 163 
HRR = 921 

3? = 7-6xio~ 10 



RR 

DR3 
+ - 
47 43 90 

21 69 90 
68 112 180 

RR = 359 
p =5 3 x I0 ~ 5 

DR4 
+ - 
58 32 90 

22 68 90 
80 100 180 

RR = 560 
p = 6-8 x io" 9 



Table 4. HRR's and RR's for the DR3, 3, DR3, 4 and DR4,4 phenotypes 

(Samples are the same as those described in Table 3. In each case comparison is made relative to the 
4 base group * DRX, X to avoid the problems of non-independent risk estimates.). 





Disease 


Parental 


Workshop 


DR type 


sample 


control 


control 


DR3, 3 


6 


0 


i-3 


DR3, 4 


25 


2 


28 


DR4, 4 


4 


0 


i*S 


DR3, X 


16 


13 


i6'4 


DR4, X 


29 


10 


IT4 


DRX, X 


10 


48 


50'5 


Total 


00 


73 


899 


HRR 




RR 




HRR(3, 4) == 


600 


RR(3, 4) 


= 45i 


HRR(3, 3) = 


undefined 


RR(3, 3) 


= 233 


HRR(4, 4) = 


undefined 


RR(4, 4) 


= 135 



If 'expected values' are substituted for the zero observations in the parental control, one gets: 

HRR'(3, 3) = 37 4, 
HRR'(4, 4) = 39"2. 



size as our disease sample (i.e. 90 individuals). Table 4 gives HRR's and RR's for the three DR 
phenotypes DR3, 3, DR3, 4, and DR4, 4 using the same samples. Here the risks are compared 
to the baseline phenotype DRX, X in each case since the risks are not independent (cf. 
Curie-Cohen, 1981, Svejgaard & Ryder, 1981). Note that the HRR's for DR3, 3 and DR4, 4 
are undefined since there are no 'individuals' with those phenotypes in the control sample of 
73. If expected values are substituted for the ' zero ' values in those cases HRR's can be estimated 
as given at the bottom of Table 4, but the use of such estimates must be made with caution. 
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DISCUSSION 

One of the major problems inherent in proper calculations of relative risks (RR's) is that of 
choosing an appropriate control. A basic assumption in the use of RR's is that both the affected 
sample and the control sample are chosen at random from the same genetically homogeneous 
random mating population with no selection criteria except for the disease status required for 
inclusion in the affected sample. In practice this is a difficult criterion to fulfil. Additionally, 
it adds a significant amount of work to select and test such a control sample. It is therefore 
often assumed that the control sample is simply a hypothetical sample created from a population 
thought to be similar to that of the disease sample and 'generated 1 from that population by 
assuming H-W equilibrium and some reasonable sample size (cf. Svejgaard & Ryder, 1981, and 
our Contrived * sample of the previous section). 

Given the known heterogeneity of current urban populations, even within the less hetero- 
geneous European countries, use of population control data culled, for example, from HLA 
workshop surveys, may alter the significance of calculated RR's. Although, in the examples 
given here the results are significant for both RR's and HRR's (Table 3), the 'p-valucs' for 
significance differ by two-fold (for DR3) and 100-fold (for DR4), with the HRR's being more 
significant in each case. If less extreme samples were tested, careless choice of the control group 
could very well make the difference between statistical significance and non-significance 
(resulting in either a type I or a type 11 error). 

Methods have previously been proposed for using sibship information to calculate * risks'. For 
example. Clarke (t%l) describes a method, attributed to C. A. B. Smith, for using sibships to 
test for a significant risk of duodenal ulcers to individuals of blood group 0. The method used 
is somewhat different from that described here in that an observed and expected probability 
of being group 0 is assigned to the propositus in each sibship where the expected value depends 
on the makeup of the sibship. The significance is then based on a comparison of pooled observed 
and expected values over a set of sibships. This method does overcome the problem of 
heterogeneity but. because of the way the test is constructed, only a small part of the data can 
be used. In Clarke's example, therefore, the associations found when using the general 
population as a control were very much decreased when using Smith's sibship method. This does 
not seem to be the case using HRR's where the associations remain strong. 

By using the two parental haplotypes not present in the single diseased individuals of the 
disease sample as the control sample 1 , we are assured of having both samples from the same 
genetic population and. as was demonstrated above, this sample should represent a random 
sample of haplotype pairs (or individuals') from that population. Care must still be taken to 
ensure that the population chosen is genetically homogeneous, to the extent possible, but the 
task of obtaining an appropriate control is simplified. 

If the disease is dominant rather than recessive, the HRR can still be used in the same way. 
Although it is not known whether the disease allele is present on the paternal haplotype ('a') 
or the maternal ( r ) or perhaps on both, the other two parental haplotypes, b and d, will still 
represent random haplotypes from the underlying population, provided that the conditions 
mentioned for the recessive case obtain. 

If ;i family is selected through more than one affected child, the situation is somewhat 
different . If the t wo affected sibs share the same two HLA haplotypes then the other two should 
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still represent random haplotypes from the population. However, if they share fewer than two 
haplotypes, the situation is more complicated. Now three (or possibly four) haplotypes are 
known to carry the disease allele in the recessive case. If the disease is dominant, it is possible, 
but not certain, that a single shared haplotype carries the disease allele. If no haplotype is 
shared, it is not possible to define disease-carrying haplotypes with certainty. In such cases it 
would therefore be difficult to define a control sample of random haplotypes meeting the 
necessary criteria. 

Two other points should be emphasized. If there is differential selection between genotypes 
at the susceptibility locus, (e.g. reduced fertility) a bias might be introduced such that the 
control haplotypes could no longer be considered a random population sample. Thus we require 
compliance with assumption (3) of our model to ensure the proper distribution of susceptibility 
alleles in the ' control ' haplotypes. 

Further, if the population from which the sample is drawn is genetically heterogeneous with 
respect to the disease, the HRR as well as the RR may be difficult to interpret as well as to 
use. In an extreme case a population might be made up of two ethnically distinct subpopulations 
that do not interbreed. Assume that the disease of interest occurs in only one of two such 
subpopulations. An estimate of the HRR would come entirely from a sample taken from the 
subpopulation where the disease is present and would be relevant only to that population 
(individuals in the other group having no risk, by definition). On the other hand, the RR would 
assign a risk over the entire population that would be too low for individuals in the susceptible 
part of the population and too high for individuals in the non-susceptible part. 

We wish to thank Drs Jurg Ott, Neil Risch and C. A. B. Smith for helpful and constructive comments on 
an earlier draft of this paper. 
This work was supported by NIH grant GM291 77. 
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